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Preface 



This book is an extended collection of contributions that were originally submit- 
ted to the 1st International Workshop on Adaptive Multimedia Retrieval (AMR 
2003), which was organized as part of the 26tlr German Conference on Artificial 
Intelligence (KI 2003), and held during September 15-18, 2003 at the University 
of Hamburg, Germany. Motivated by the overall success of the workshop as 
revealed by the stimulating atmosphere during the workshop and the number of 
very interested and active participants - we finally decided to edit a book based 
on revised papers that were initially submitted to the workshop. Furthermore, 
we invited some more introductory contributions in order to be able to provide 
a conclusive book on current topics in the area of adaptive multimedia retrieval 
systems. We hope that we were able to put together a stimulating collection of 
articles for the interested reader. 

We like to thank the organization committee of the 26th German Conference 
on Artificial Intelligence (KI 2003) for providing the setting and the adminis- 
trative support in realizing this workshop as part of their program. Especially, 
we like to thank Christopher Habel for promoting the workshop as part of the 
conference program and Andreas Gunther for his kind support throughout the 
organization process. Last but not least we like to thank all members of the 
program committee for providing their support in reviewing the submitted con- 
tributions, the workshop participants for their willingness to revise and extend 
their papers for this book, and Alfred Hofmann from Springer- Verlag for his 
support in publishing this book. 
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Abstract. Multimedia information retrieval poses both technical and 
design challenges beyond those of established text retrieval. These issues 
extend both to the entry of search requests, system interation and the 
browsing of retrieved content, and the methodologies and techniques for 
content indexing. Prototype multimedia information retrieval systems 
are currently being developed which enable the exploration of both the 
user interaction and technical issues. The suitability of the solutions de- 
veloped within these systems is currently being explored in the annual 
TRECVID evaluation workshops which enable researchers to test their 
indexing and retrieval algorithms and complete systems on common tasks 
and datasets. 



1 Introduction 

The rapid expansion in the availability of online multimedia content has led to 
a similarly rapid growth in research into technologies for automated retrieval of 
multimedia information. The potential for exciting new multimedia applications 
targeted at operational environments ranges from entertainment and education 
to academic research and intelligence services. The possibilities of what these 
systems might achieve is to a significant extent limited only by the imagination 
of system developers. Systems for multimedia information retrieval are by their 
nature complex typically requiring the integration and adaptation of a number 
of existing technologies as well as the development of novel algorithms and tech- 
niques. The success of realizing these visions of what might be is limited both by 
the availability of the required technologies, but also to a considerable degree by 
the quality of the analysis of user and system requirements and the consequent 
design of the retrieval application. 

The diversity of multimedia content types and the variety of environments for 
which multimedia information retrieval systems might be developed means that 
there is no single ideal solution. It is thus vital when exploring the development 
of a new multimedia information retrieval application to properly understand the 
user and their need for the application (whether or not this is an existing need 
or an application created one), the capabilities of the hardware environment in 
which the application must operate, and the available retrieval and information 
management technologies. 



A. Niirnberger and M. Detyniecki (Eds.): AMR 2003, LNCS 3094, pp. 1-18, 2004. 
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In order to better understand the importance of all these components of a 
multimedia information retrieval system this paper first explores the possible 
definitions of multimedia information retrieval and the importance of adaption 
in these systems, then briefly examines some important issues relating to both 
multimedia systems and user tasks, considers issues relating to retrieval and 
content indexing, then goes on to demonstrate the application of some of these 
features within the Fischlar system being developed in the Centre for Digital 
Video Processing (CDVP) at Dublin City University, the next section then out- 
lines the international TRECVID task for evaluating and understanding current 
video retrieval technologies, the paper ends with some concluding thoughts on 
future research directions. 



2 What Is a Multimedia Information Retrieval System? 

One important question is, what constitutes a multimedia information retrieval 
system? It is often assumed that it must involve retrieval of full motion video, 
but it is sometimes referred to in the context of the retrieval of spoken documents 
without reference to associated video content or the retrieval of static images. 
In relation to video and image retrieval, multimedia retrieval process itself often 
involves only the analysis of linguistic material associated with the visual content, 
in the case of video the spoken soundtrack and the use of textual labels for 
images. Where visual data is present, it is natural to think in terms of analysis 
of this content and its use in the retrieval process. However, as we will see later 
this is much less straightforward than might at first be assumed. 

Another significant issue in respect of the definition of multimedia retrieval 
is to consider what the system must actually be capable of doing. Established 
text information retrieval systems, as exemplified by web search engines, use a 
user search request to compute a ranked list of potentially relevant documents 
which is returned to the user with a short piece of text from each document 
that hopefully reliable indicates the main reason why this document has been 
adjudged to be potentially relevant. The user then selects the document that they 
feel is most likey to satisfy their information need, and downloads the document 
to read it and extract the necessary information from it. It may seem obvious, 
but in the context of extending this paradigm to multimedia data is important to 
appreciate that, the user currently addresses their information need by reading 
the whole text document, with perhaps a small amount of keyword searching 
if it is a long document. One possibility for a multimedia information retrieval 
system is merely to replicate this text searching environment requiring the user to 
audition retrieved documents to find the relevant information. However, several 
features of multimedia data mean that these systems must be both more complex 
and include the user in a much more integrated way. 

Firstly, in the case of temporal media such as video and spoken data, browsing 
large amounts of material is very time consuming since the user must listen 
to it. Playback can be faster than real-time, but even doubling the speed of 
delivery does not begin to address this problem. Second, when considering visual 
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media there is the fundamental question of how the user should express their 
information need. In the case of linguistic data it is natural to assume that a 
written search request is an appropriate form of expression, although this may 
not always be the case, for example if the user is uncertain about domain specific 
terminology. Browsing of temporal multimedia documents is addressed in most 
video and audio retrieval systems by the development of a graphical browsing 
application which aims to direct users to potentially relevant sections of the 
document without having to play back the document in its entirety from the 
beginning. These browsers typically show time on a horizontal graphical bar 
representing a complete retrieval unit, e.g. a document, with potential points of 
interest marked along the bar [1] [2], Further indication of the content can be 
given in the case of video content by the use of a series of keyframes taken from 
the video which are shown in a single screen. Typically clicking any selected point 
on the document bar will begin play back from that position in the document. 

How though should the user express a need for information contained in 
non-linguistic form for visual content searching? Perhaps they can express their 
request in text which is then matched against automatically generated labels 
of the visual content. Or perhaps they can sketch what they are looking for 
and an image search performed. Or if they happen to have an existing image 
example this could be used in a “query by example” framework to find similar 
images. These approaches make various assumptions about the sophistication of 
automatic content indexing, the user’s ability to express what they are looking 
for or the availability of existing exemplars of what they are looking for. As we 
will see later feature extraction is one of the most significant challenges facing 
visual based multimedia information retrieval, and probably one of the reasons 
that many such systems make heavy reliance on the use of linguistic content. 
One of the weaknesses of this dependence on linguistic content is that spoken 
soundtracks and textual image labels will in general only express a very limited 
interpretation of the visual content, and in some cases will bear no relationship 
to the visual content at all. Thus there is a very real need for the development 
of visual indexing technologies. 

The difficulty in expressing information needs for visual content and the 
limitations in visual indexing mean that the searching process will often need 
to be much more interactive with the user involved in multiple cycles of query 
refinement to actually find what they are looking for. For image retrieval this will 
typically involve a combination of searching on textual labels and then refinement 
by selecting images that are related to the desired image using query by example 
feedback cycles. For video retrieval a similar process will be carried out using 
content from the spoken soundtrack and feedback using keyframes from the video 
and potentially complete scenes. The limitation and unreliablity of video feature 
extraction and the usual importance of any associated linguistic content means 
that retrieval decisions should generally be based on a combination of matching 
scores derived from multiple media streams. 

The complexity of the retrieval and browsing phases of multimedia informa- 
tion retrieval makes it attractive to make use of any additional information that 
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might be available. Thus the system should make use of explicit user feedback 
from the current search: e.g. “this document is relevant”, “this person may be 
important in relevant documents”; implicit feedback, e.g. playback of an entire 
document often suggests that it is relevant or at least partially relevant; or the 
user’s previous searching history. Feedback methods are considered further in 
Section 5.1. 

3 Multimedia Systems 

The broadest definition of a multimedia system usually involves the potential 
to deliver visual and audio content to a user as required. This may mean de- 
livery from a local source such as a DVD or CD-ROM, or playback across a 
network which may itself only be a local area network or a much larger wide 
area network. The fidelity of the content that can be delivered will depend on 
both the computational resources available at each point in the network and also 
the bandwidth of the network itself. 

Until fairly recently a multimedia system would involve a higlr-power com- 
puter connected to a hard-wired network. However, this situation is rapidly 
evolving to include broadband wireless networks and the capability of multi- 
media processing on handheld computing devices such as PDAs and mobile 
telephones. The various networking technologies involved in connecting these 
devices to the network have different band widths and latency specifications, the 
computing devices themselves have varying resources for data processing and 
differing physical resources for information delivery, and importantly the users 
of these different platforms are working in a variety of different environments. 

All these issues taken together mean that multimedia information retrieval 
applications must be appropriate to the network, the hardware being used and 
the user’s physical environment. Thus the applications should adapt to the multi- 
media system being used. For example, the fidelity of the content delivery should 
not exceed the capacity of the network or the computing device, and the user 
interface to the system should take into account the physical dimensions of the 
computing device enabling the user to view the output easily on small devices 
while not restricting the possibility for complex interaction and visualization on 
desktop systems. 

4 User Tasks 

The specifications of the multimedia platforms and networks, and the available 
indexing and retrieval technologies only provides the potential to develop effec- 
tive multimedia information retrieval applications. It is vital in attempting to 
specify useful applications that developers analyze the needs and potential needs 
of the users of these applications. 

It is often argued in respect of user interface design for computing applica- 
tions that these should be based on a careful analysis of the tasks that users will 
really wish to carry out, and that this should include concepts and vocabulary 
with which the target user group are already familiar. This is often referred to 
as adopting a strategy of user centered design. 
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While this is certainly true of multimedia information retrieval systems, since 
users will often not be familiar with applications of the type that we are trying 
to develop, it seems inevitable that new concepts will be introduced that users 
will not be familiar with. In this case it is important that these novel concepts 
are ones which build on those with which the user is already familiar. It is 
often tempting to develop applications for developers, and not those targeted 
at real users. I suspect that this is particularly true of multimedia information 
retrieval applications and I would caution developers to always bear this point 
in mind. For example, while it may seem attractive to develop interfaces which 
adapt automatically using machine learning techniques based on input from user 
behaviour, these modified interfaces are unlikely to find favour with users if they 
cannot work out how to perform operations because basic interface consistency 
principles are being broken in the adaptation process. Much guidance on these 
issues is available on the user interface design literature [3]. 

5 Information Retrieval 

A full description of information retrieval methods is beyond the scope of this 
paper, this section highlights some relevant features from text retrieval methods 
that can be applicable for multimedia applications. 

Text retrieval systems are usually based on computing a matching score be- 
tween some form of textual search request and each available document in an 
archive. A list of documents ranked by matching score is then returned to the 
user. There are a number of elaborations on this approach are available to im- 
prove performance or adapt the method to different tasks. These include rele- 
vance feedback methods, personalization, and recommender systems and collab- 
orative filtering. 



5.1 Relevance Feedback 

Relevance feedback methods provide a number of possible techniques to adapt 
user search requests. The input to the relevance feedback process is the existing 
search request, the set of documents returned in response to this query, and 
the user’s judgements of the relevance of these retrieved documents. The output 
is typically a ranked set of possible expansion terms that may be added to 
the existing request and information to modify some of the parameters of the 
search system to enhance the ranking of documents similar to those marked 
relevant in the current search. It is hoped that adding the proposed additional 
terms to the request will make it a better expression of the user information 
need, and that modifying the search parameters will promote the rank of further 
relevant documents. The basic underlying assumption being that further relevant 
documents will in some way resemble those already identified. 

The search request can be expanded automatically to include the highest 
ranked of the proposed terms or the terms can be offered to the user for them to 
select the terms which they feel best reflect concepts related to their information 
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need [4]. The expanded query statement is then applied to the search archive 
and a new ranked list retrieved. The dominant effect in relevance feedback usu- 
ally relates to query expansion, but its effectiveness is usually enhanced by its 
combination with modification of search term weights to favour terms associated 
with relevant documents. 

An alternative to interactive relevance feedback to pseudo relevance feedback 
where a number of the top ranked documents in the initial retrieval pass are as- 
sumed to be relevant, the expansion terms and revised search term weights are 
computed as before but assuming the relevant document set, and then perform- 
ing another retrieval pass with this revised topic statement before presenting the 
revised ranked list to the user. Of course, some of the documents assumed to 
be relevant will not in fact be relevant, this can lead to selection of some poor 
expansion terms which can actually reduce performance for the second retrieval 
pass. On average the effect is generally observed to be beneficial to retrieval 
accuracy, but it can be disastrous for individual queries particularly if none of 
the assumed documents are in fact relevant. True relevance feedback based on 
users’ relevance judgements will in general be better. 



5.2 Personalization 

Relevance feedback usually refers only to the adaptation of retrieval systems 
parameters and the request for a single ad hoc request. A more elaborate use 
of feedback information is to provide a personalization of the retrieval system 
to the individual user. Where this is done the retrieval system will adapt to the 
behaviour of individual users possibly over a single searching session or over an 
extended period of time, or a combination of both, and its response to a search 
request will be different for each user. 

The basic process of personalization is to use previous relevance judgements 
to develop one or more profiles associated with each user that represents their on- 
going interests, e.g. particular sports teams or news topics. Profiles are typically 
a set of keywords which may be weighted based on their perceived importance in 
expressing the user’s interests. A variety of methods are possible to form these 
profiles and utilize them in searching. 



Personalization Agents. One approach to personalization of retrieval systems 
is to make use of agents to model user interests. One example of such a system 
is Amalthaea developed at the MIT Media Laboratory [5] . This system is based 
on an ecosystem of evolving agents which represent user interests. Agents are 
rewarded for delivering relevant documents to the user and the best agents re- 
produce by using the genetic methods of mutation and crossover. The lowest 
scoring agents are purged from the system with the aim of maintaining a gene 
pool of consistent size which best models the user’s current interests. Experi- 
mental studies show that Amalthaea is able to rapidly adapt to changes in the 
user’s interests. 
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5.3 Recommender Systems and Collaborative Filtering 

Personalization based on the behaviour of individual users relies on the limited 
amount of information that can be gathered based on their actions. An alter- 
native is to gather information from a number of equivalent users and combine 
this information to represent their shared interests. These group profiles are thus 
based on a broad based set of user experiences and in general more relevance 
data. This data can be used to recommend potentially relevant material and also 
within interactive retrieval [6]. 

Individual and group profiles can be combined to give personal adaptive 
profiles with contribution from group experiences. 



6 Multimedia Information Retrieval 

The previous section introduced some adaptation methods used in text retrieval 
systems. While these methods are all currently used in many prototype systems, 
they remain the subject of active research interest. This section looks at existing 
multimedia information retrieval and considers how adaptation techniques have 
been applied to date and how they might be further extended. 



6.1 Spoken Document Retrieval 

The most mature area of multimedia information retrieval relates to spoken 
documents. If the contents of spoken documents are fully manually transcribed, 
the retrieval stage would be a standard text retrieval problem. However, man- 
ual transcription of more than a trivial amount of spoken content is generally 
prohibitively expensive (domains such as mass media broadcast TV or film are 
a notable exception) and spoken document retrieval systems thus usually rely 
on transcriptions generated by automatic speech recognition systems. Various 
approaches to speech recognition for indexing spoken documents for retrieval 
have been explored, but comparative experiments have demonstrated that for- 
mation of a full transcription using large vocabulary recognition gives the best 
output for retrieval purposes [7], at least in the domain explored of TV news. 
It is not clear whether this is the optimal indexing solution for less structured 
data with a vocabulary less well matched to the document domain. As with all 
speech recognition systems, transcription systems make errors in their output. 
The number of errors is related to the quality and content of the audio signal 
and typically varies from around 5% to over 80% with an average using current 
systems of around 20%. It has been found in experiments that this level of errors 
in the transcription has only a very small impact on retrieval accuracy [7]. 

For some data sources, such as TV broadcasts, textual closed captions or 
subtitles of the audio are broadcast with the audio-video material. While often 
not a perfect transcription of the audio material, the quality is usually better 
than that generated automatically using speech recognition. The closed-captions 
can be decoded into standard text and used as the search index data. The closed 
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captioning is usually not closely aligned to the actual audio data. However, a 
forced alignment speech recognition phase can be used align the audio content 
to the closed captions enabling fine granularity searching. 

Where the number of indexing errors is sufficiently high to impact retrieval 
performance, it is generally found that pseudo relevance feedback methods are 
particularly effective for improving the spoken document retrieval [8]. 

6.2 Image and Video Retrieval 

Retrieving multimedia content using information extracted from visual media is 
much more challenging technically in terms of feature extraction, but also from 
the perspective of user interaction. Images frequently have many interpretations, 
some of these can be measured directly from visual features, but often the in- 
tended interpretation will depend on the context in which the image is being 
viewed. For example, an object may be identified as a building, as a cathedral, 
as a specific named cathedral, as being of a particular style and period of ar- 
chitecture, or by the religious denomination to which it belongs. Some of these 
interpretations can be made directly from analysis of the image using suitable 
templates, others would require additional information sources to be consulted. 
Even to carry out the image only interpretation requires the image to be indexed 
using appropriate features. 

In principle if a standard set of features could be agreed, and it is not at 
all clear that this could be possible, then all video content could be manually 
annotated. However, the cost of doing this would be uneconomic for all but the 
most important data. Thus automated feature indexing is likely to be even more 
important than for spoken content indexing. 

Automatic feature extraction for image and video data is currently the focus 
of a large research effort, but so far the achievements remain very limited. For 
image retrieval indexing is often based on extraction of colour histograms with a 
limited amount of spatial information included. Much research is also exploring 
specific feature extraction tools, often relating to specific domains, for example 
identification of named people or people in general, cars, sky, etc. Video analysis 
often includes structural indexing such as the detection of shot boundaries, and 
attempting to identify keyframes from within identified shots. The same colour 
histogram and feature extraction techniques are typically applied to individual 
frames. Ideally features should be derived automatically, robust, accurate and 
above all useful for retrieval. It would be nice to build feature detectors for each 
query as it is entered, but this is not practical and retrieval must make use of 
the feature analysis carried out when the data was initially indexed. Systems 
are typically configured to only attempt to recognise the presence of a very 
limited number of features. The limited number of features and the difficulty 
in defining features that are in generic means that image analysis systems are 
domain specific. This may be a very tightly specified domain, e.g. recognising 
the presence of a moving care in a video, or broader (but nevertheless limited to 
a specific task), e.g. retrieving images from a collection of disjoint photographs 
using matching of colour regions. 




Adaptive Systems for Multimedia Information Retrieval 



9 



The matching of identified visual features and search queries is so far fairly 
simple by comparison to text retrieval for which a number theoretically moti- 
vated models have been developed. 

One of the important areas for multimedia information retrieval is the inte- 
gration of data from the multiple media streams, e.g. audio, visual and textual. 
Evidence of each media stream can be used to reinforce each other for more accu- 
rate feature identification and retrieval. For example, a name individual may be 
recognised by uncertain evidence from visual recognition, speaker identification 
from the audio stream, and the name appearing in accompanying textual infor- 
mation. If all these sources indicate the presence of a named individual, there is 
a very good chance that this person is indeed present in this shot. Research in 
this area is again at an early stage, but encouraging results are being achieved 
using techniques such as Support Vector Machines [9] . 

Moving beyond recognition of objects in narrow domains to identification, 
tracking and interpretation of objects within complex video is a long term re- 
search goal of those working in this area. 



Relevance Feedback. Difficulties in defining search requests and indexing 
mean that user feedback in an iterative searching process is an important topic 
for multimedia information retrieval. Feedback can be used for adaptation in the 
search, both by applying it to the linguistic data, as is already widely exploited, 
but also to the image data. An example of relevance feedback for image retrieval 
is described in [10]. 

An important observation with respect to relevance assessment is that judg- 
ing relevance of textual documents takes a small, but potentially significant 
amount of time, particularly is a large number of documents must be judged. 
However, user assessment of the relevance of an image is almost instantaneous. 
Thus, while it is unclear how to specify a search request properly representative 
of the user information need for visual search, it is comparatively easy to obtain 
large amounts of relevance data from the user in response to initial search runs. 

This suggests that there is a much greater need for the “user-in-the-loop” 
for searching of visual media. Thus while fully automated searching may not be 
effective due to the ambiguity in image interpretation, much relevance data can 
be collected for each search to better understand the user’s need in this context 
[ 11 ], 

Given that the purpose of feedback here is not so much to “find more like 
this” as is often assumed to be the case for text relevance feedback, but rather 
to learn more about what is required in a relevance image, it makes sense to 
show the user the most-informative images for feedback, which may not coincide 
with the most-positive images for an individual search. This can be thought 
of as the differences between a “slrow-me-the-results” type display versus an 
“ask-me-questions” user model [11]. 

The difficulties of specifying information need and the ease with which users 
can make relevance judgements of images are two of the motivating factors in 
the proposed use of the ostensive models of relevance feedback for image search 
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as described in [12]. The ostensive model supports a query-less interfaces in 
which the user’s indication of the relevance of the retrieved objects is used as 
the indication of the user’s current information need. Therefore it allows direct 
searching without the need to formally specify the information need. 

Image Clustering. It is sometimes argued that images can be clustered to 
assist browsing of similar images to one identified as relevant. However, seman- 
tically meaningful clustering depends on the subspace in which a semantic con- 
cept class lies, in the case of images semantically meaningful classes will depend 
largely on the user’s current interpretation of the image relevance. It is this 
information that relevance feedback is trying to capture, so prior clustering of 
images may often be of limited utility for image searching. 

An illustration of the issue of preclustering is provided by [13]. Images are 
clustered based on colour features, manual examination of the clusters quickly 
reveals inappropriate groupings. However, moving images between clusters to 
correct these mistakes and reclustering incorporating this feedback information 
leads to more reliable clusters. This provides an example of the importance and 
effectiveness of involving the human user in the management of image data. 

A number of systems are currently in development to explore issues in mul- 
timedia information retrieval and to investigate the development of techniques 
to improve multimedia information retrieval performance. One such system is 
the Fischlar-News Video being developed within the Centre for Digital Video 
Processing at Dublin City University (DCU). The next section outlines Fischlar 
highlighting its current use of adaptation. The next section outlines the current 
Fischlar prototype system, and the following one introduces the TRECVID in- 
ternational evaluation exercise and discusses the application of Fischlar to this 
task. 

7 Fischlar-News 

Fischlar is a digital video archive system based on MPEG-7 digital video content 
management and retrieval, and supports playback using both fixed and mobile 
computing devices. Fischlar is deployed across the university campus at DCU 
and currently has more than 1000 regular users. 

Fischlar-News automatically records the 30 minutes 9.00pm news every day 
from the Irish national broadcaster RTE1. The system is accessible on campus 
via any web browser [14] and is now being made available on mobile devices [15]. 
Currently several months of recorded news is available online with more than 2 
years of material in the overall archive. 

In order to support access from different platforms Fischlar-News is based 
on XML technologies, which by incorporating XSL transformations for each new 
device required, can easily be extended to incorporate new access technologies, 
devices and standards. Figure 1 shows the basic architecture of Fischlar-News 
illustrating both desktop and mobile access and the process of automatic news 
story segmentation. 
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Fig. 1. Architecture of Fischlar News. 



In Fischlar-News mobile access is supported for both PDAs (Campaq iPAQ 
on a wireless LAN) and XDAs. Each of these play RealVideo encoded content 
encoded at 20Kbps in order to support streaming access a mobile phone network 
on an XDA. The desktop version of the system uses a conventional web browser 
with MPEG-1 video streaming. 

Fischlar currently performs shot boundary detection on the captured data 
and identifies scenes and keyframes. One area of current work is on the auto- 
mated segmentation of news broadcasts into story level units which typically 
combine a number of separate shots [9]. Although work is progressing well on 
this, the current prototype system relies on manual segmentation of each broad- 
cast into story units. 

7.1 Fischlar-News on the Desktop 

The basic level of browsing on desktop Fischlar-News is the news broadcast. The 
basic display is shown in Figure 2. Selecting a broadcast from the calender on 
the left hand side displays a list of news stories from this broadcast. Each news 
story within the broadcast is represented by a keyframe and textual description 
of the story. 

When presented with a list of news stories the user has the option to either 
play back the news story by clicking on the “PLAY THIS STORY” button 
(Figure 3) or to examine the story at the shot level by selecting the keyframe 
or the numbered news story title. This produces a detailed list of camera shots 
and associated closed caption text (Figure 4). 

The size of the broadcast archive means that content-based searching is vital 
in Fischlar News. Each news story is represented by a textual description auto- 
matically extracted from the closed caption text broadcast along with the audio 
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Fig. 2. Fischlar News with stories from one program. 



and video data. This textual description is used for story-level searching of the 
archive. The search query is entered in the query box at the top left hand corner 
of the interface. 

A ranked list of stories is returned and displayed in the story column on 
the right hand side of the interface. Individual stories can be played back and 
browsed at the shot level as before. However, in addition in this case the story 
can be viewed in the context of that day’s news broadcast by following the date 
link which displays a listing of news stories from the news program for that day. 

Using the closed caption transcripts similar stories are identified and links 
formed between the related stories. Fischlar-News generates a ranked list of the 
ten most similar news stories. These related stories can be shown in a ranked 
story list on the right hand side of the display and browsed as for the other 
options. 

In order to provide personalization and recommendations, user feedback is 
collected. At any point while browsing the archive or an individual story the 
user is presented with the opportunity to rate the story on a five point scale 
from “do not like” to “like very much”. In addition to gathering explicit user 
feedback, usage data is collected automatically as the user plays back or browses 
news stories. Recommendations from users are used as one of the primary access 
mechanisms for the mobile version of the Fischlar-News system. 

7.2 Fischlar-News on Mobile Devices 

The small display size and the difficulties of data input for mobile devices, as well 
as the observation that users are often engaged in distractive environments while 
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Fig. 3. Playing back a news story. 



using these devices, present major constraints in the design of mobile applica- 
tions. In consequence it has been suggested that different interaction paradigms 
are suitable for mobile devices, rather than just porting the interaction methods 
from the desktop environment. Based on various user studies some general design 
principles for interaction with mobile devices have been proposed. These include 
principles such as minimizing the required user input, e.g. provide yes/no op- 
tions, hyperlinks, and filtering the available information to deliver only content 
that is most likely to be relevant to this user. 

These guidelines suggest that more pre-processing of the information should 
be carried out by the system prior to delivery to the user. For example, increasing 
the use of recommender technologies so that material is delivered with less user 
interaction. This is particularly important for multimedia information retrieval 
where browsing is such an important issue in information access. The Fischlar- 
News mobile application uses the personalized list of news stories as the primary 
access point for mobile users [9]. 

The starting point for access on the mobile Fischlar-News system is the per- 
sonalized story list shown in Figure 5. The only input that is required from the 
user is to select a news story to play back using RealVideo, shown in Figure 6. 
This approach minimizes the user input by filtering out content that this user 
may not be interested in. 

As an alternative to the personalized list, the user can be presented with a 
reverse chronological listing of recorded news programmes, shown in Figure 7. 
This enables them to view the entire archive. When a broadcast is selected it is 
presented to the user as a listing of composite news stories, as shown in Figure 8. 
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the Cabinet decides not to replace the Government jet. - And: a crash in Co Wexford kills a young couple - and 
their four-month-old daughter. -Two former senior officials at the Blood Transfusion Service - have been brought 
before the Dublin District Court- on charges relating to the contamination of a blood product- with the hepatitis C 
virus. - Dr Terry Walsh, who used to be Chief Medical Consultant of the Blood Bank - and biochemist Cecily 
Cunningham were arrested at their homes today. Both face charges of unlawfully and maliciously causing a 
noxious thing, namely infected blood product anti-D, - to be taken by seven mothers who sustained "grievous 
bodily harm". A few mothers need to receive the nti b d d ce e ery year 
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because tests show they can develop antibodies which would attack the foetus of any subsequent pregnancy. 
Prior to anti-D, babies died from what was called "blue baby syndrome". 



Fig. 4. Shot-level browsing of a news story. 



8 TRECVID — Benchmarking Video Retrieval 



TREC (Text REtrieval Conference) is an annual research exercise organised by 
the US NIST which enables groups to perform comparative experiments on in- 
formation retrieval tasks. TREC culminates in a workshop where participants 
gather to report their results and the methods used to achieve them. TREC, now 
in its 12th year, has explored a wide range of information retrieval tasks includ- 
ing ad hoc retrieval, web retrieval, cross-language retrieval, question-answering, 
interactive retrieval and spoken document retrieval. TREC is a global activity 
with around 100 groups now participating in one or more of the tasks. 

Since 2001 TREC has included the TRECVID track which aims to explore 
video data retrieval. Each year TRECVID makes an agreed set of data available 
to registered participants. A number of tasks are then performed on this data. 
These tasks have included: shot boundary detection, semantic feature extraction, 
news story segmentation, and interactive searching for relevant video associated 
with a set of predefined topics. The organizers and participants develop a set 
of perfect results or relevant stories, and all submissions are scored relative to 
these ideal results. 

TRECVID has confirmed video retrieval as a very challenging task and par- 
ticipants have worked collaboratively both to markup the data for development 
and testing, and shared their work. Thus groups have shared the task of marking 
up data with the features present in each shot and shared the output of their 
automatic feature extraction tools to enable greater research exploration by the 
community. Features explored in TRECVID so far include: outdoors/indoors, 
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face detection, people detection, cityscape, landscape, speech, instrument sound. 
Search topics for the interactive evaluation include issues such as: “locomotive 
approaching the viewer,” and “microscopic views of living cells” . 

The CDVP at DCU participation in TRECVID 2003 is extending the pro- 
totype Fischlar system described in the previous section. The group is partic- 
ipating primarily in two tasks: the automated story boundary detection task 
using a technique based on video analysis and support vector machines, and the 
interactive searching task. 

For the interactive task a group of test subjects were given a number of 
the search topics released by the TRECVID organisers at NIST. The test data 
for TRECVID 2003 was a set of news broadcasts from CNN and ABC. The 
objective is to locate as many relevant items as possible within a limited search 
time. Each broadcast was segmented by NIST into a standard set of shots with a 
keyframe for each shot. In order to support this task the interface outlined in the 
previous section was extended to include user marking of relevant shots retrieved 
in response to the initial search query. The text from this shot and information 
automatically taken from analysis of the keyframe for the shot were used in a 
relevance feedback adaptation of the initial search to expand the search query. 
The features taken from the keyframe were based on a combination of regional 
colour histogram analysis, average and maximum regional colour and regional 
edges. In the feedback search run the expanded text query was matched against 
shots and a matching score was also computed between the keyframe from each 
relevant shot and other keyframes in the collection. The scores generated from 
the text and images searches were then combined to form an overall score for 
each shot. A revised retrieved list was then presented to the test user for further 
relevance judgement and, if needed, further iteration of the query. Full details 
of the DCU participation in TRECVID 2003 are contained in [16]. 



9 Concluding Remarks 

Multimedia information retrieval is a challenging research area which despite 
recent progress will continue to present research challenges for many years. The 
complexity of the searching task both in terms of specification of information 
need and location of relevant content means that adaptation via relevance feed- 
back, personalization and collaborative filtering is potentially very useful for 
these tasks. Further, the proliferation of devices with the capability for multi- 
media playback and searching means that interfaces and search modalities must 
be adapted to take account of the differing interactivity constraints of these 
devices. While it may be possible to do this to some extent automatically this 
needs to be handled with care in order to maintain interface consistency between 
different platforms. 

Current prototype systems demonstrate that it is already possible to build 
large scale networked multimedia information retrieval systems. The quality and 
bandwidth of networked computing is set to increase further in the coming years. 
There are interesting challenges in terms of developing new multimedia informa- 
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tion management tools and interfaces appropriate for the many platforms that 
are available. By far the most significant technical challenge at this point would 
appear to lie with the automated extraction and labelling of features and their 
use in the retrieval process. 
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Abstract. Ever-increasing amount of multimedia available online neces- 
sitates the development of new techniques and methods that can over- 
come the semantic gap problem. The said problem, encountered due 
to major disparities between inherent representational characteristics of 
multimedia and its semantic content sought by the user, has been a 
prominent research direction addressed by a great number of semantic 
augmentation approaches originating from such areas as machine learn- 
ing, statistics, natural language processing, etc. In this paper, we review 
several of these recently developed techniques that bring together low- 
level representation of multimedia and its semantics in order to improve 
the efficiency of access and retrieval. We also present a distance-based 
discriminant analysis (DDA) method that defines the design of a basic 
building block classifier for distinguishing among a selected number of se- 
mantic categories. In addition to that, we demonstrate how a set of DDA 
classifiers can be grouped into a hierarchical ensemble for prediction of 
an arbitrary set of semantic classes. 



1 Introduction 

The vast increase of the amount of available multimedia content necessitates the 
development of new techniques and methods that not only are able to store and 
retrieve data effectively, but also can, with or without user’s assistance, over- 
come the semantic gap problem. The said problem is encountered due to major 
disparities between inherent representational characteristics of multimedia, such 
as color, texture, shape or motion descriptors, and its meaningful content sought 
by the user. Exacerbated by the issue of perception subjectivity, i.e. the change 
of relevance judgments from one individual to another, the semantic gap prob- 
lem has been shown to adversely affect the performance of many multimedia 
database retrieval systems [1], Naturally, this area has been a prominent re- 
search direction addressed by a great number of approaches originating from 
such areas as machine learning, statistics, natural language processing, etc. 

In the discussion that follows, we would like to identify such approaches as 
those of semantic augmentation since most of them are specifically focused on 
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bringing together low-level visual representation of multimedia and its semantics 
thus augmenting the information used by a multimedia database system in order 
to improve the efficiency of access and retrieval. Our further consideration will 
focus on two groups of methods for semantic augmentation: interactive (Sec- 
tion 2) - the adaptive approaches that are guided by relevance feedback supplied 
by the user, and automatic (Section 3) - those attempting to derive useful corre- 
lations between representational characteristics of multimedia and its semantic 
aspects by applying techniques that do not involve the user. 

2 Interactive Semantic Augmentation 

The common trait of the methods belonging to the first group of techniques 
to be considered is the assumption of active user presence during the retrieval 
process. The user is regarded as the principal source of semantic knowledge in 
various possible scenarios of interaction. This knowledge may not be explicitly 
mapped onto a semantic concept (i.e., named) at the end of the process but 
we consider that the level of description or discrimination attained during the 
interactive process is high enough to be called semantic. 

In this section, we therefore review the most common ways of capturing and 
exploiting user interaction in view of enhancing the retrieval process. In most 
systems, user interaction is taken as relevance feedback [2] from a search result 
to a subsequent search step. In this scheme, the user is offered a search result 
and (s)he should mark (some) items of this set as being relevant or irrelevant 
(possibly under a fine scale). 

Primarily, this mechanism allows for having a direct computation of individ- 
ual items’ importance within the search context. This is exploited in the com- 
putation of Rocchio’s formulation [4] for adapting term weight at search time 
(section 2.1). From a different viewpoint, when considering that relevance feed- 
back creates inter-item relationships, one may then derive properties from their 
co-occurence. This is used in both association rule mining and collaborative fil- 
tering (section 2.2). Alternatively, the interaction may be exploited in a Bayesian 
framework. Relevance feedback is therefore the base for learning and classifica- 
tion (section 2.3). These techniques are originally essentially blind to the type 
of item under management. In section 2.4, we review some schemes that adapt 
these into the content of interactive image retrieval. More specifically, various 
tasks classically associated to CBIR are combined into an integrated framework 
for a collaborative interactive semantic description of the data. 

2.1 Rocchio’s Algorithm 

Many of the early methods for interactive semantic augmentation emerged from 
the efforts proven effective in the field of document retrieval, and were built ac- 
cording to the scheme illustrated in Figure 1. These approaches were closely tied 
to the underlying vector space model [3] inheriting the weight calculation rules 
based on the notions of term and inverse document frequencies, and processed 
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Fig. 1. Typical scenario of retrieval with relevance feedback 



relevance judgments supplied by the user via an additive adjustment formula 
known as Rocchio’s algorithm: 

Qnew — CtQold T [3 ^ ^ Di T 7 ^ ^ Di, (1) 

where Q 0 id is a query feature vector from the previous relevance feedback iter- 
ation, Di and Di are given documents from relevant and non-relevant sets, re- 
spectively, and a, (3 and 7 are tuning parameters. Using this strategy, integrated 
with the vector space model for retrieval, the notion of similarity is interactively 
adapted to the user profile by distorting components of the indexing space. 

The Viper system [5] adapts classical document retrieval techniques to the 
context of CBIR. It includes Rocchio’s algorithm as a way to handle feedback 
and to account for the sparsity of positive examples against negative examples. 
By tuning (3 and 7 parameters in Equation (1), this system is robust against 
abundant negative examples that would normally make retrieval inconsistent [4] . 



2.2 Association Rules and Collaborative Filtering 

While the above feedback strategy is based on the features of the items within the 
collection, the principle here is to derive knowledge from the users themselves. 

Assuming the same system is accessed by several users, one wishes to pre- 
dict information in a given case based on a history of interaction. Let a dataset 
(collection) D be composed of items, regrouped in itemsets. We wish to create 
associations between itemsets X and Y (A" n Y = 0) within a particular trans- 
action T (eg query process) of the form A" => Y . That is, within a particular 
context of interaction, we wish to state that whenever itemset X is considered, 
itemset Y will also be considered. More formally, we wish to estimate the value 
of 

P(Y C T\X C T) (2) 

Muller et al. [6], propose to use this technique to achieve long-term learning in 
the context of Content-based Image Retrieval in the Viper system based on the 
vector space model for retrieval. From usage log, relevance feedback is exploited 



22 



Serhiy Kosinov and Stephane Marchand-Maillet 



to derive association rules between pairs of images marked relevant or other- 
wise. Rather than acting of the documents themselves, the authors propose to 
apply a long-term weight to the basic image features so as to set emphasis on 
discriminant features. 

Collaborative filtering approach [7] uses aggregated subjective evaluations 
from a group of users to recommend items to an active user. Typically, from 
a history of choices made by a population of users on a number of items, one 
wishes to predict the choice of a user on a particular item. That is, propagating 
other user’s choices onto a particular user, based on known correlations between 
that particular user and others who already made a decision on that item. More 
formally, let Vij be the vote of user Ui on item Oj and A the set of items on which 
user m has made a decision. Then, in the simplest case, Vi = (Ejei 4 Vi ,j)/\^\ 
is the average vote of user Ui on A, which correlates with the profile of user Ui . 
Thus, the predicted feedback v a j of active user u a on item Oj £ I a is given by 
the “profile” of user u a added with a weighted combination of personalised votes 
on item Oj (vtj for all ui y^ a ). 

n 

V a ,j = V a + K ^ w(d, l) (v it j ~ Vi) , (3) 

i=l\i^a 



where the weight w(a,i) represents the correlation between user u a and user «j. 
In early studies, this is simply taken as the Pearson correlation coefficient 



w(a, i) 



EjOaP ~ Va)(Vi,j ~ Vi) 

~ Va ) 2 E ~ Vi ) 2 



(4) 



User choices are accumulated. After showing a certain profile (v a ) by interacting 
with the system, user u a then receives recommendations for subsequent searches. 

It is important to note here that items are blindly considered as entities and 
that the complete recommendation procedure is done without any knowledge of 
the item features. The performance of the system is uniquely based on the quality 
of the correlation computed and the consistency of the information propagated. 
In [8], this system is used to create a WebMuseum able to distinguish between 
styles of painting, simply by accumulating user relevance feedback. 



2.3 Bayesian Handling of Relevance Feedback 

Here, we still consider the classical relevance feedback protocol. Simply, pos- 
itive and negative examples become the base for a Bayesian classification. In 
[9] , positive and negative examples are treated separately. Positive examples are 
successively used to estimate the parameters of a Gaussian distribution of their 
features. Negative examples are used as center of penalty functions so that in- 
ferred results “stay away” from these examples (this process is referred to as a 
‘dibbling process’ in [9]). 

Similarly, Vasconcelos and Lippman [10] propose to use positive and negative 
examples in a learning cycle by first evaluating a classification based on positive 
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examples xt and, based on the N best positive classes, solve an equation of the 
type 



St = argmaxj 




P(x t \Sj = 1) 
P(Vt\Si = 1 



+ (1 



a) log 



P(Si 

m 






( 5 ) 



integrating y t as negative examples over time t. Equation (5) simply states that 
the class best described by the set of positive and negative examples at hand 
is that maximizing the posterior odds ratio between hypothesis “class i is the 
target” and “class i is not the target”. 

A different setup, yet using similar techniques has been used in the classical 
Cox et aV s PicHunter browsing system [11]. It is assumed that the user seeks a 
specific target within the collection. (S)lre is then successively proposed samples 
of the collection containing the most probable targets infered by the system using 
a Bayesian posterior estimation. 

Still in the spirit of learning from feedback, Tong and Chang [12] among 
others [13, 14] propose to use Support Vector Machines (SVM) for achieving 
concept active learning. 



2.4 Image Specific Framework 



Most of the previous techniques have been developed in the context of document 
retrieval and may be applied on multimedia in general, provided the right fea- 
tures are used. In the field of CBIR, a number of alternative usage of relevance 
feedback have been proposed. Here, we do not just aim at creating adaptive 
similarity associations between documents (ie feature vectors), we wish to de- 
rive further useful properties of the image themselves. 

For example, in [15, 16], Jing et al propose a strategy to discovering region 
importance in images handled by a CBIR system. The strategy is to pre-segment 
the images using the classical JSEG algorithm [17] and then to compute inter- 
region similarity. A region and an image are called similar if the image contains 
a region similar to the region the image is compared with. Among the set of 
positive examples, each region is weighted by a region frequency (RF) denoting 
its consistency with other regions within the positive set. Then, based on a 
scheme similar to the TF*IDF scheme for document retrieval, Inverse Image 
Frequency (IIF) is also computed as the importance of a given region within an 
image (ie its ability to characterize the image) Finally, the region importance 
(RI) of region i in a given image is computed as 



Rl, = 



RFi * II Ft 

E"=i RF i * IIF i’ 



( 6 ) 



where n is the number of regions within the image. This region importance is 
finally accumulated in a linear scheme along the feedback steps. 

The fact of deriving region importance within images is an appealing process 
since it forms a step towards identifying objects within the image. From there, 
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several processes may be facilitated. This is true for retrieval, becoming region- 
based retrieval but also segmentation and compression. In [18], this is discussed, 
along with the presentation of a complete integrated framework for interactive 
document retrieval and description. The main observation is that any interac- 
tion with the data may be a valuable semantic input. The proposal is therefore 
to make the data accessible from a number a ways including retrieval, descrip- 
tion, viewers and so on. An important aspect is that the data is accessed from 
several points by several people. In the proposed system, the authors aim at 
incrementally and collaboratively gathering and inferring high-level information 
on the data immersed within this system. Eventually, content description may be 
fixed into a knowledge base. It is shown that such an approach tightly relating 
content-based retrieval and content description poses new solvable challenges, 
as opposed to classical CBIR whose performances tend to saturate in current 
systems. 




£> 



Fig. 2. The functional schema of a CBIR system completed with possible acquisition of 
semantic knowledge. K, RJ and U mark places where a priori Knowledge, Relevance 
Judgments and User interaction may be inserted, respectively 



In the setup shown in Figure 2, user knowledge is captured at various lo- 
cations of an integrated framework. Techniques such as that described in the 
previous sections may then be used to infer long-term semantic knowledge about 
the data. 
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3 Automatic Semantic Augmentation 

Similarly to the methods described in the previous section, many of the auto- 
matic semantic augmentation techniques have their origins in the areas of textual 
document retrieval and statistical natural language processing. 



3.1 Latent Semantic Indexing 

One of the most influential text-based approaches whose main principles are still 
actively used in research to date is that of latent semantic indexing (LSI) [19]. In- 
troduced as a means to tackle the problem of synonymy, i.e., the non-uniqueness 
of sets of words (or, terms) that can describe the same concept, the LSI method 
assumes existence of an underlying latent semantic structure in the textual data 
partially obscured by the randomness of word choice. In order to recover such la- 
tent structure, the method performs a truncated singular value decomposition of 
the original term-document co-occurrence matrix (see Figure 3(a)) transform- 
ing it into its reduced-rank approximation. Thus, the main idea behind this 
transformation is to capture the major term to document association relations 
ignoring minor differences in terminology. Finally, a cross-language variation of 
the LSI (CL-LSI), proposed by Landauer and Littman [20], that allows a query 
in one language to retrieve documents in another language can be considered 
a conceptual prototype of a whole family of automatic semantic augmentation 
methods [21-24]. Indeed, as can be seen from a comparison of term-document 
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Fig. 3. A comparison of vector space models in LSI-based methods 



matrices for CL-LSI and Multimedia-LSI methods shown in Figures 3(b) and 
3(c), one can easily replace the part that corresponds to the other language key- 
word information with multimedia feature data extracted from images, videos, 
etc. Thus, instead of retrieving documents in a language different from that 
of a query, it should be possible to find multimedia “documents” whose visual 
content corresponds to that of a query specified by keywords, and vice versa. 
In other words, the same approach can be used to establish important associa- 
tions between visual feature representation of multimedia and its corresponding 
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semantics conveyed by the annotation keywords. And this is exactly the issue 
explored in detail by LSI-based automatic semantic augmentation methods such 
as those of Zhao and Grosky [23,24] and others mentioned before. 

3.2 Cross-Language Modeling 

An analogous idea of treating the visual feature data as another language to 
translate keywords to and from is developed in a substantially different state of 
the art approach proposed by Barnard et al. [25,26]. According to the adopted 
translation model, the authors consider the problem of object recognition as the 
one of machine translation. Given a representation in one form (image regions, 
or blobs, derived by clustering segmented images) they attempt to turn it into 
another form (keywords) using a developed model that acts in the capacity of a 
lexicon. Thus, the pairs of images and their respective annotation keyword sets 
are regarded as aligned bitexts, in which word-to-blob correspondence is to be 
established (see Figure 4 for an example). Finally, the sought correspondence 

Images 



Blobs 



Keywords 




Fig. 4. An example of correspondence between image regions (blobs) and annotation 
keywords sought by the translation model approach [25, 26] 

is determined by optimizing the likelihood of word-to-blob association over all 
possible assignments, expressed as: 

N M n L n 

p{w\b) = X\^2,P(anj = i)t(w = w nj \b = b ni ), (7) 

n= 1 j = 1 i= 1 

where N is the number of images, M n is the number of keywords associated 
with the n-th image, L n is the number of blobs that the n-th image is seg- 
mented into, p(a n j = i) is the probability of association of a particular blob bi 
with a specific keyword Wj, and t(w = w n j\b = b n t) is the transition probability 
of word w given blob b. This likelihood is subsequently maximized via the EM 
algorithm [27]. A further development of these ideas by Jeon et al. [28] lead to 
the cross-media relevance model for automatic image annotation and retrieval, 
while research efforts with a greater focus on various aspects of the underlying 
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generative probabilistic models undertaken by Blei and Jordan [29] produced 
correspondence latent Dirichlet allocation, - a model that finds conditional re- 
lationships between latent variable representations of sets of image regions and 
sets of words. The latter method’s properties were assessed through a compar- 
ison with two alternative hierarchical mixture models of image data and as- 
sociated text (Gaussian-multinomial mixture model and Gaussian-multinomial 
latent Dirichlet allocation) demonstrating its superior performance on the appli- 
cations of automatic image annotation, automatic image region annotation, and 
text-based image retrieval. 

A method proposed by Vinokourov et al. [30] explores the similar cross- 
language paradigm for learning a semantic representation of web images and 
their associated text. In contrast to the above mentioned approaches [25,26, 
28,29], these authors take a different route and choose not to model the la- 
tent semantic aspects via generative probabilistic schemes. Instead, they focus 
more on statistical techniques, namely, the kernel Canonical Correlation Anal- 
ysis (KCCA) [31], originally developed for extracting the translation-invariant 
semantics from the aligned corpora in English and French, i.e., where every text 
in one language Xi G X has a corresponding translation yi € y in another lan- 
guage. The main hypothesis of such a technique is that having the corpus {xi}^ =1 
mapped to some high-dimensional feature space T x as d>(xi) and corpus {yi}^L 1 
to T y as d>{yi), it is possible to learn semantic directions f x G T x and f y G T y in 
those spaces so that the projections {fx,@{xi)) i=l and (f y ,d>(yi)) i=1 of the orig- 
inal data in two different languages would be maximally correlated. This leads 
to a correlation coefficient maximization problem, formulated as given in (8): 



Pf 



max 
(/*,/«) ^ 



\fj2i (fy,®(yj)) 2 



(8) 



which, as the authors show, can be solved as a generalized eigenvalue problem. 
Of course, the same underlying formalism can be applied not only to extract 
translation-invariant semantics from aligned bilingual texts, but also to find 
correlations between, for instance, web images and their attached textual an- 
notation [30,32] and subsequently query an image database only by keywords, 
qualifying the technique also as an automatic semantic augmentation method. 



3.3 Statistically Motivated Techniques 

Considering the automatic semantic augmentation approaches we have men- 
tioned so far, the powerful influence of the natural language processing and 
modeling perspectives is evident. However, there exist a number of methods 
derived from purely statistical premises, such as that of Mori et al. [33] based 
on dual clustering of visual and associated keyword information, also referred 
to as the co-occurrence model in some sources [25,26,28]. The authors propose 
to subdivide every image from the annotated collection into a number of non- 
overlapping segments, each of which inherits keywords associated with its corre- 
sponding image. Then, the visual feature representations of these segments are 
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clustered by vector quantization and the estimates of likelihood of each keyword 
for every cluster are derived by pooling the associated keyword frequencies. Hav- 
ing established such a clustering, the method then processes a query image with 
unknown annotation by subdividing it into segments and predicting its likely 
keywords from those of the clusters to which the query image segments are most 
similar. 

A two-dimensional multiresolution hidden Markov model (2D MHMM) [34] 
is at the core of another statistical approach to automatic semantic augmenta- 
tion proposed by Li and Wang [35]. The author’s system for automatic linguis- 
tic indexing of pictures (ALIP) operates with a predefined number of semantic 
image categories, specified by sets of keywords according to the problem do- 
main. For each category, the system profiles a 2D MHMM using as input the 
feature vectors extracted from training images at multiple resolutions and ar- 
ranged on a pyramid grid. Once the training is complete, the set of the obtained 
2D MHMM’s together with the keywords of their corresponding categories are 
stored in a common dictionary of semantic concepts (see Figure 5). This dictio- 




for concept N 



Fig. 5. Structural design of the ALIP system [35] 



nary can subsequently be used to annotate new, i.e. not present in the training 
sets, images, which is done by selecting the keywords of the categories whose 2D 
MHMM’s yield highest likelihoods computed from the features extracted from 
the images to be annotated. 
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3.4 Distance-Based Discriminant Analysis 

In our work [36] , we also take a statistical vantage point and consider the seman- 
tic augmentation task in the discriminant analysis framework [37-40] , where one 
seeks to distinguish between two or more predefined semantic classes or cate- 
gories of multimedia documents. Considering for the time being the simple case 
with two semantic classes only, we are looking for such transformation of visual 
feature data that places instances of a given class near each other while relocat- 
ing the instances of the other class sufficiently far away in some feature space. In 
other words, the sought transformation T must enhance the conformance of the 
data to the compactness hypothesis [41] with the aim of improving the accuracy 
of nearest neighbor (NN) [42] classification. Formally, such a transformation is 
found as a solution of an optimization problem where the following criterion is 
to be minimized: 

Nx -ZVx Ny 

“> g J < r > = ivuiVv - 1) S' 106 * «< T » - yk £ S' ^ ■ “S C> I s *) 

i<j i— 1 j= 1 

where Nx, Ny are the number of observations in data sets X and Y representing 
the two classes; dX ( T ) denotes a Euclidean distance between points i and j 
within data set X transformed by T, and, analogously, dfj(T) specifies a distance 
between the f-tlr point from transformed data set X and the j-tli point from 
transformed data set Y; (d]J(T)) is a Huber robust estimation function [43]. 

The problem of minimizing (9) is solved via iterative majorization [44] , which 
replaces the task of optimizing a complicated objective function by an iterative 
sequence of simpler minimization problems in terms of the members of the fam- 
ily of special auxiliary functions. Given the properties of these auxiliary func- 
tions [45,46], generally referred to as majorizing functions, the iterative proce- 
dure generates a non-increasing sequence of objective function values, converging 
to a stationary point which is a local minimizer under certain constraints. For 
the chosen optimization criterion, (9), we derive an approximative majorizing 
function, up to a constant independent of T, as expressed in (10): 

m os j(T,T) = ^tr (T t X T RXT) + ^tr(T T Z T GZT) — 2f3tr(T T Z T GZT), (10) 

where T is the supporting point of the sought transformation T, i.e., its value 
at the current iteration, R and G are design matrices for distance computations 
(see [36] for details), and Z is the matrix that holds data sets X and Y joined 
together. At every iteration, the minimization of (10) is solved with respect to 
T, producing the result that becomes a supporting point of the subsequent it- 
eration, and so on, until the convergence is reached. This process of deriving a 
discriminative transformation together with the presumed NN classification con- 
stitutes the essence of the distance-based discriminant analysis (DDA) method, 
which, it is possible to demonstrate, can also be used for multiple-category set- 
tings, as well as for dimensionality reduction. The experimental results for a 
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number of benchmark data sets and multimedia collections [47-49] show an im- 
provement in NN classification accuracy when the data transformation derived 
by the DDA method is applied, as well as very competitive performance with 
respect to many existing methods. The summary of object recognition results 
for the ETH80 database [47], where 80 images were used for training and 3200 
for testing, is given in Table 1, while an overview of the database is provided in 
Figure 6. 










Fig. 6. The 8 classes of objects of the ETH-80 database. Each class contains 10 objects 
with 41 views per object, for a total of 3280 images 



Table 1. Object recognition results for the ETH80 image database 





Apple 


Car 


Cow 


Cup 


Dog 


Horse 


Pear 


Tomato 


% Error, NN 


4.47 


14.47 


12.12 


3.09 


14.00 


14.47 


6.13 


2.50 


% Error, DDA 


0.75 


5.78 


10.97 


2.22 


12.72 


13.16 


3.84 


1.88 



As the above results demonstrate, the developed DDA method can be used 
as a basic building block classifier for distinguishing among a selected number 
of semantic categories. However, apart from the fact that this set of relevant 
semantic categories must be, as a rule, manually engineered for every problem 
setting, the design of such a set also has a drawback of being ‘'flat”, i.e., the cat- 
egories are assumed to be independent, non-overlapping and sufficient to cover 
all of the problem domain (e.g., see Figure 5). This shortcoming pertinent to a 
number of statistically motivated methods is, on the other hand, overcome by 
approaches that rely on linguistic or probabilistic generative modeling to cap- 
ture inter-relations among different semantic aspects (recall, for instance, the 
latent semantic structure or aspects from the above discussion). In our work, 
this issue is addressed in the following manner. First of all, we deliberately re- 
frain from hand-picking suitable semantic categories, but focus on individual 
keywords as basic semantic units of the annotated multimedia data instead. For 
each unique pair of keywords a similarity estimate is computed based on their se- 
mantic relationships, e.g., hyponymy/hypernymy, extracted from WordNet [50]. 
Then, the obtained pairwise similarity estimates are processed by an agglomera- 
tive hierarchical clustering algorithm [51] in order to build a semantic hierarchy 
of keywords, an example of which is depicted in Figure 7. Finally, we use the 
derived hierarchy as a layout blueprint and group a number of DDA classifiers 
in exactly the same fashion, i.e., a separate classifier for each branch or junction 
of a structure, example of which is shown in Figure 7. 

The resulting hierarchical ensemble of classifiers is functionally similar to a 
mixture of experts model [52] where a classification decision at a given node 
A f depends on the classification outcomes in parent nodes on the path leading 
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Fig. 7. An example hierarchy of semantic categories obtained by clustering 



to A/”. In this framework, the task of image autoannotation is performed by 
starting from the root of the hierarchy (the right side of the diagram in Figure 
7) and subsequently traversing the structure all the way till the terminal nodes 
(keywords) are reached, while accumulating the results of discriminant analysis 
performed at every node along the way. The keywords with highest accumulated 
likelihoods are then selected to annotate the new image. From the examples of 
image autoannotation illustrated in Figure 8, one may notice that the predicted 
keywords are semantically close to the true ones, even in the situations when 
some of keywords were not available for learning in the training set, as is the 
case with an out-of-vocabulary word volcano predicted as a group including 
semantically related keywords hill, mountain, rocks. 




Query true keywords: 
water, corals, sky, clouds 

Predicted keywords/groups: 

1) 0.0457 corals, water 

2) 0.0409 grass 

3) 0.0405 beach, cliffs, hill, mountain, rocks 

4) 0.0383 ocean 

5) 0.0371 ocean, river, sky 




Query true keywords: 
volcano, sky, clouds, rocks 

Predicted keywords/groups: 

1) 0.0774 beach, cliffs, hill, mountain, rocks 

2) 0.0601 ocean, river, sky 

3) 0.0584 building, concrete 

4) 0.0398 building, concrete, house 

5) 0.0398 ocean 



Fig. 8. Image autoannotation examples 

An important characteristic of the described method is the arrangement of 
classifiers in a semantic hierarchy, which, in a way, brings together a statisti- 
cal classification and linguistic modeling paradigms. Also, in contrast to other 
autoannotation methods we have discussed so far, a proper normalization that 
takes into account the number of traversed links also allows to predict keyword 
groups or more general semantic categories in addition to individual keywords 
(see Figure 8). However, the approach is not as versatile as other techniques [29, 
30] in the sense that it cannot directly be used for keyword-based multimedia 
retrieval. 
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4 Summary 

In this paper, we have reviewed a number of contributions representing a broad 
category of multimedia semantic augmentation approaches that aim at overcom- 
ing the semantic gap problem in order to improve retrieval performance. Various 
methods originating from the fields as diverse as machine learning, natural lan- 
guage processing, statistics, etc., have been examined with a focus on the par- 
ticular aspects that make it possible to advance from basic computer-accessible 
visual attributes of the data (e.g., color, texture, shape descriptors and other 
features) to the meaningful (implicit or explicit) characterization of the con- 
tent of the multimedia documents. According to whether or not user presence 
is necessary, the reviewed techniques have been arranged in two subcategories: 
interactive and automatic. 

Interactive techniques operate during a retrieval process. They are essen- 
tially based on the concept of relevance feedback where the basic user input is 
the degree of relevance of each proposed item with respect to a query. This infor- 
mation may be used either to adapt the notion of similarity with that of the user 
or to create inter-relationships between documents. In both cases, and provided 
the feature space is rich enough, one may achieve high-level classification of the 
documents within the collection. 

Automatic methods, on the other hand, attempt to discover useful relations 
between representational characteristics of multimedia and its semantic aspects 
requiring no user involvement. Even though the relative degree of adaptability 
to subjective user judgements of such approaches may not exceed that of inter- 
active techniques, these methods, nevertheless, prove helpful when applied to 
semantically-guided search in multimedia databases, and advantageous in sit- 
uations when multimedia documents representing certain semantic notions are 
sought. These techniques generally strongly rely on classification and we pro- 
posed the DDA technique as a further contribution to the content classification 
problem. Both interactive and automatic semantic augmentation approaches 
have a great potential of making multimedia retrieval more intelligent, which 
has been demonstrated by a number of notable contributions described in this 
paper. 
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Abstract. We report our work towards building user models of learner's devel- 
opment based upon evidence of their interactions with an e-learning website 
composed of multimedia learning objects. Essentially, this involves three ele- 
ments: effective metadata for each learning object; analysis of each user’s time 
spent with each learning object; reasoning about each individual’s knowledge 
of the course that is based upon a collection of learning objects. In this chapter, 
we focus on the problem of creating the metadata in a manner that will support 
effective user modelling. The chapter begins with a brief discussion of our 
overall architecture for building a user model and defining the metadata for 
multi-media learning objects. We then describe Metasaur, the interface used to 
create the metadata, and SIV, the visualisation interface to support users in 
scrutinising the reasoning about them. We also briefly describe the process used 
to construct an ontology automatically from an existing dictionary. Importantly, 
we describe an extension that enables a person to create new metadata terms 
and link these elegantly into the ontology. We report a qualitative study in the 
use of the Metasaur interface for its two roles, creation of metadata and scrutiny 
of the user modelling processes. 



1 Introduction 

The holy grail of improved e-learning is the possibility of providing each learner with 
their own, personalised tutor. Evidence from studies of human one-to-one tutoring [ 1 ] 
indicate the potential for huge gains in learning effectiveness, potentially taking the 
student at the average level in the class up to the top of the class’s level of perform- 
ance. The core of such personalised teaching requires that the teacher maintains an 
effective model of the learner and the means to use that to effectively personalise the 
teaching. The core of our work is to find ways to model learners more effectively. 

One particularly valuable source of information about the user of an e-learning 
system is the trace of their activity within it. There is already wide use of learning 
management systems such as WebCT [2] and Blackboard [3]. These collect extensive 
amounts of information about each learner’s activities within the system. Because this 
is generally in the form of simple web log style of information, it constitutes a huge 
volume of data that is rather difficult to interpret. 
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Even so, this type of information is clearly of great potential value in modelling 
learners as a basis for improved teaching. This is reflected in the emerging interest in 
converting such data into forms that a teacher might use to improve the teaching. For 
example, one approach is to provide a visualisation that summarises the large amount 
of data in a useful form [4]. A similar approach involves applying data mining tools 
to build useful models of the class for teachers [5]. 

We are following a somewhat different approach. We are trying to build user mod- 
els of each student in the class. We take a very broad definition of a user model as a 
set of beliefs about the user [6]. For example, in a course on user interface design, we 
would like to build a model of how well the learner knows each component of the 
course. The design of such a model will be defined by the long term purpose for 
which the user model is needed. As we have already noted, one of the long term goals 
is to personalise teaching. We have taken a rather novel approach to defining our 
user models. It is based on the assumption that a teacher who is creating a per- 
sonalised teaching system may wish to exploit existing learning objects. We also 
assume that the teacher may regularly add new learning objects or remove existing 
ones. 

We have two main concerns that drive the design of our user models. Firstly, of 
course, the user model should be available to applications which need to customise 
their actions. So, for example, we would like to deliver additional direction and guid- 
ance to students who are progressing poorly. Students who are progressing extremely 
might be offered extension materials. We also want to be able to allow students some 
choice in the aspects they study, tempering that choice by the requirement that the 
student progress adequately on the basic material required by the teacher. 

Our second goal is to enable students to scrutinise their student models to see how 
they are doing. This means that the user model serves as a basis for supporting reflec- 
tion by the student. It has been argued [7, 8] that this can be important for improving 
learning. Students can then delve into the details of each component of the user model 
to scrutinise the processes used to create the model. In particular, the user can find out 
how the value for a component was arrived at and inspect the evidence sources which 
informed the system’s conclusion about that value. In this way, the learner interacts 
with an externalisation of the system's model of their knowledge. This also provides 
a means of self reflection. Students can inspect a component where they are getting a 
score that is not as good as they expect. They can see how the system concluded that 
their level of knowledge was poor. They can reflect and consider whether they agree 
with the system’s conclusions. If they do not agree, they can add evidence indicating 
that they believe they know that component. If they see that the system’s evidence 
does indicate weakness, they can start planning to improve it. 

We have been exploring interfaces that can support users in scrutinising large user 
models. In this paper, we will describe two recent innovations in that work. The first 
makes use of an ontology to structure the user model into a graph whose structure 
should give flexible support for users scrutinising related parts of the user model. We 
describe the way that we construct this ontology in a manner that is consistent with 
our goal of scrutability. In the case of the ontology, we want users to be able scruti- 
nise the structure of the ontology and the processes used to form it. This means the 
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user should be able to determine why the system considers that two terms A and B 
are related and how. So, for example, the user should be able to determine why the 
system considers that A is a specialisation of B. Our very simple approach to this has 
been to build a light weight ontology automatically by analysing specialized diction- 
aries. Then we can use the dictionary definitions as the basis for explaining any ele- 
ment of the ontology. Since the dictionaries were, by definition, written explicitly for 
the purpose of explaining the meanings of terms in a form suited to humans, they 
have the potential to be an excellent basis for supporting scrutability as we wish to 
do. 

There are several challenges we have tackled in the building of the user models for 
our course. Firstly, the multimedia learning materials need to be marked up with 
associated metadata. This metadata is critically important: as the learner interacts with 
learning objects, we want to use the associated metadata for those learning objects to 
provide a source of evidence for user modelling. For example, suppose a student 
attends an on-line lecture that has metadata indicating that it teaches about usability. 
If the student completes the lecture properly, this provides some evidence that the 
student has learnt something about usability. This means that the metadata terms need 
to match the components in the user model. 

We recently realised that our user model visualisation tool has the potential to be 
helpful for the markup of metadata on learning objects. In this case, we do not use it 
in the usual way, to gain an overview of an individual’s learning progress. Instead, 
the user model terms and their associated ontological structure assist the teacher who 
wants to associate appropriate metadata terms with learning objects. Note that this 
essentially means that the defined user model’s vocabulary serves as the metadata 
term set. In addition, we exploit the ontology-based visualisation to help the teacher 
consider both the terms that they identified as appropriate for defining the learning 
content; they are also presented with a display of closely related terms that might also 
be used. This is particularly useful in new disciplines where terms are evolving. Note 
that an alternative approach would be to use the ontology as a basis for inferring addi- 
tional metadata terms automatically. This is akin to the information retrieval work 
that makes use of thesauri [9]. We are currently exploring the alternate approach 
where the teacher who is making use of a learning object explicitly chooses the meta- 
data. This has the merit of ensuring that the teacher feels in control. Moreover, the 
teacher can apply sophisticated background knowledge to their decision based on 
many factors that are well beyond the reasoning possible from an ontology alone. For 
example, the teacher may take into account factors such as the aspects they plan to 
assess in final examinations, what is taught in prerequisite or concurrent courses that 
students undertake as well as the demands of subsequent courses. 

In this chapter, we describe our exploration of ways to repurpose our user model 
visualization tool to support creation of metadata on learning objects so that student 
interactions with these can serve as a basis for creating evidence about a user’s devel- 
oping knowledge. We begin with an overview of the context and architecture of this 
system for building user models from accesses to learning objects marked up with 
metadata. Then we describe the type of learning object that we have been working 
with and the nature of the task of defining metadata for such learning objects. We 
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then introduce Metasaur, the interface that supports creation of metadata. In order to 
describe it, we introduce its elements. One of the most important is the user model 
visualisation tool, SIV (Scrutable Inference Viewer). Another critical element is 
Mecureo, the tool that constructs the ontology used by SIV. The last element of the 
description is the mechanism we have created for enhancing the ontology. This is a 
critical innovation since we want to be able to make use of existing dictionaries but, 
at the same time, we want to allow teachers to add new user model component terms 
(and corresponding metadata terms) if they feel the need. As we will show, the way 
that Mecureo operates means that we have been able to use an extremely simple and 
elegant but effective approach to this task. Throughout this work, we have performed 
a series of evaluations of the interfaces and approaches. We summarise these and 
illustrate the effectiveness of the current approach. 



2 Context and Architecture 

The tools and techniques in this chapter are described in the context of a course in 
User Interface Design and Programming. It is taught through a combination of online 
material and face to face tutorials. The course has a website that is customised for 
each student by presenting material such as pre-recorded lectures and laboratory ex- 
ercises relevant to their enrolled course (regular, advanced and postgraduate). 

There are 20 online lectures that students are expected to attend at times they can 
choose, but partly dictated by the assignment deadlines. There are recommended 
deadlines for attending each lecture. Each lecture has around 20 slides, and each slide 
has associated audio by the author. Generally this audio provides the bulk of the in- 
formation, with the usual slide providing a framework and some elements of the lec- 
ture. This parallels the way many lecturers use overhead slides for live lectures. An 
example of a slide is shown in Figure 1. This figure shows a slide that is typical in 
that it has a quite brief summary of the aspects that are discussed in the audio. In this 
particular case, the audio lasts for 186 seconds. Some of the shortest slides have audio 
lasting just 10 seconds while some of the longest are around 600 seconds. 

Students should not only view the slides but also listen to the audio (and make 
their own notes) in order to gain an understanding of the concepts discussed. A user 
profile keeps track of student marks and lecture progress. These profiles along with 
the web access logs can provide user data required to have a good foundation for 
creating rich user models. 

Learning objects such as our online lectures are typical of the large range of online 
resources which allow students to study the course content at any computer they wish 
to use and at any time. Unlike a conventional face-to-face lecture, the student can 
stop the lecture at any time, replaying an element that they did not quite follow. Stu- 
dents can return to the lecture or a particular slide in it, as they need, for example, 
when tackling an assignment, working on a tutorial task or preparing for final exami- 
nations. As more learning resources become available on-line, the log data about their 
accesses becomes a rich source of user modelling information. Each access can be 
regarded as a form of evidence that the student has progressed in their learning. Such 
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evidence may not necessarily be highly reliable. In fact, it is prone to many weak- 
nesses. For example, one student may masquerade as another. A student may access a 
page but not actually learn anything from it. However, such evidence is readily avail- 
able. It is very attractive as a source of user modelling evidence. It is cheap, plentiful 
and does not require additional input from the user, even if it must be interpreted with 
some caution. To exploit it, we need to add metadata to the learning objects. Figure 2 
shows overall architecture of our system for building user models from such data. 




Fig. 1 . The User Interface Design and Programming course website. This screenshot shows one 
of the audio slides from the Cognitive Walkthrough lecture. 

At the top, we show the teacher who is responsible for the course being undertaken 
by the student shown at the bottom. The teacher creates the metadata for each learn- 
ing object by interacting with our ontology-enhanced interface, Metasaur. As each 
student interacts with a learning object, this is recorded in a log, shown at left. Our 
system analyses this log and the metadata for each learning object to create evidence 
that is added to the student’s user model. This model can then be used to drive per- 
sonalisation of the teaching and, as shown in figure, the student can scrutinize their 
own model using SIV. 

To actually construct the user model, we use the UIDP-UM toolkit, consisting of 
several scripts written in Python. This processes the User Interface course website 
access logs and the metadata created by Metasaur. User models are generated and 
stored in Personis [10]. A resolver compares the evidence in each component of the 
user model with data for that component for the Perfect User. The perfect user has 
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attended all of the slides in the course once, for the exact length of the audio. Each 
component in the user model is given a score based on how close the learner’ s listen- 
ing time compares with that of the perfect user. The final user model data is serialized 
to XML. SIV then overlays the initial Mecureo-generated ontology with user data. 




Fig. 2. Overview of architecture of system for building user models from logs of accesses to 
metadata for learning objects. 



3 The Problems of Creating Metadata 
and the Metasaur Architecture 

The task of annotating existing documents with metadata is challenging and non- 
trivial because it is hard to be thorough and consistent, and the task is both demand- 
ing and boring, especially in systems with many existing documents and a large meta- 
data term vocabulary [11]. 

A reflection of the recognition of the importance and difficulty of this problem of 
metadata markup is the growing number of tools which are exploring ways to support 
the task. For example, one such tool, Annotea [12] builds on Resource Description 
Format (RDF) technologies, providing a framework to allow users to add and retrieve 
a set of annotations for a web object from an “annotation server”. 

Since it is such a tedious task to add the metadata by hand, there is considerable 
appeal in finding ways to automate part of the process. Even in this case, there is 
likely to be a need for human checking and enhancing of the metadata. It should be 
possible to build interfaces that can support both the checking of metadata which was 
created automatically as well as hand-crafting of metadata. 
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Ontologies have the potential to ease the task of creating metadata by providing a 
common vocabulary of terms and relationships for a domain. They also play an im- 
portant role in the vision of the Semantic Web [13]. It makes sense to exploit the 
technologies and standards developed as part of the Semantic Web initiative. As the 
content on the WWW grows, so too does the number of tools, standards and tech- 
nologies to manage WWW content available for us to use. The Ontology Web Lan- 
guage (OWL) [14] is one such standard which aims to provide a standard representa- 
tion for ontologies in the Semantic Web. 

In spite of their potential, there are problems in exploiting ontologies. One is that 
ontologies are generally time consuming to construct [15]. It is, therefore, appealing 
to find ways to create ontologies automatically. The OntoExtract tool described in 
1 16] is an example of one such system. 

Another challenge in exploiting ontologies relates to issues of interfaces. If we are 
to exploit ontologies as an aid in the metadata markup, we need to provide intuitive 
and effective interfaces to the ontology. These are critical in supporting users to navi- 
gate the ontology to find the terms they want to use as metadata and then to easily see 
the closely related ones that may also deserve consideration as potential metadata 
candidate terms. The value of good interfaces to ontologies is reflected in the range of 
novel ontology visualisations tools such as Ontorama [17], Bubbleworld [18] and the 
graph drawing system by Golbeck and Mutton described in [19]. 

We now give an overview of the structure of Metasaur, our ontology-enhanced in- 
terface at the upper part of Figure 2. This is shown in Figure 3 and is described 
briefly below to give a context to the detailed descriptions that follow for the core 
elements. 

At the top, we show the existing dictionary source. In the current work, we have 
been using an online glossary of HCI terms [20] which reflect the material taught in 
the course. As shown at the top, Mecureo automatically generates the ontologies in 
the OWL format as we will discuss. This is imported into the main Metasaur interface 
which is shown as combination of the SIV visualisation of the ontology and the dis- 
play of the learning object. The teacher who is creating the metadata interacts with 
this interface. The figure shows the metadata produced at the lower right. 

The Metasaur architecture as described to this point was used in our first imple- 
mentation of Metasaur. As we will describe, our evaluations indicated that our collec- 
tion of learning objects and the courses in which they were used involved many terms 
that were not in the Usability First glossary that was the foundation of our ontology. 
Upon reflection, we decided that this is likely to be a significant problem in general. 
We are convinced that it should be possible to build an excellent foundation ontology 
from a specialised existing dictionary-like resource, such as Usability First (and 
FOLDOC, the Free Online Dictionary of Computing [21], used in our earlier work). 
However, in a fast changing area like computing, we cannot expect that they will 
have all the terms of importance for marking up learning objects. Perhaps more im- 
portantly, technical and specialized dictionaries often omit core and common terms 
that are widely understood. Yet these will often be important for modeling student 
knowledge. 
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For example, one of the important concepts in our course is the needs of novice 
users and the usability techniques that are applicable to them. The term, novice, does 
not appear in the Usability First glossary. Since the teacher of the course wanted to 
model student knowledge of issues associated with novice users, we needed to deter- 
mine a graceful means to enhance the ontology. Yet these will often be important for 
modelling student knowledge. 




Fig. 3. Overview of the Metasaur architecture. 

This is discussed in [22], which addresses the problem by doing additional analysis 
of the source texts they use to construct the ontology. Teachers can simply create 
local dictionary definitions, as shown at the left of Figure 3. They then need to rerun 
the ontology building process to incorporate these into the ontology generation proc- 
ess. We will describe this in detail later. 



4 Extended Description of Metasaur 

The interface provides an interactive visualisation to an ontology as an aid to the user 
who is either creating metadata by hand or verifying and enhancing metadata that has 
been automatically created. There are existing systems that allow instructors to add 
metadata to learning objects [23] as well as standards for metadata about Learning 
Objects [5]. These systems employ an extensive description of the domain that is 
usually defined by the course instructors. Our approach is different in that we use a 
lightweight ontology [24] that is automatically constructed from an existing data 
source. It also provides a novel visualisation of the ontology that supports an ex- 
ploratory approach to discovering appropriate metadata terms in the domain. 
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There are two main parts to the Metasaur interface. The left contains the visualisa- 
tion provided by SIV (the Scrutable Inference Viewer). The learning object contents 
and visualisation control widgets are on the right. Learning objects are currently made 
up of a series of slides. The content of each slide currently consists of the slide itself, 
an audio object associated with the slide, and questions related to the concepts. The 
list of existing learning objects is available as a dropdown shown in Fig. 4. area 1. 
The current slide being viewed is in area 2. We can see that in Fig. 4. the slide titled 
Where GOMS Fits has the metadata terms GOMS and user intuition associated with it 
already. 




Fig. 4. The Metasaur interface showing the SIV interface on the left, and the slide with associ- 
ated metadata on the right. The numbered areas are as follow: 1. dropdown for selecting differ- 
ent learning objects 2. the contents of the current slide for this learning object. 3. interface for 
search queries on the ontology. 4. the existing metadata for this slide. 5. tools to allow users to 
define their own metadata terms. 6. the SIV interface for visualising the domain ontology. 
Note: this image and subsequent ones have been rendered in black and white for publication 
clarity. 
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The bottom right components (Fig. 4, areas 4 and 5) relate to the metadata for the 
learning object. Area 4 displays a list of the current terms associated with the object 
(with links back to the visualisation). Area 5 provides access to an interface so users 
can define terms that are not in the vocabulary. Users can specify a new term along 
with related terms, a category and a text description; these new terms can then be 
integrated with the ontology. Users can also select from their terms that have not yet 
been integrated into the ontology from the drop down shown in the top portion of 
area 5. In the figure, this shows the term active voice. 




Fig. 5. User selects the term heuristic evaluation on the visualisation and clicks on Add Meta- 
data Element to add the term to the slide. Note that Heuristic Evaluation is the focus of SIV 
and so is larger than the other terms and has more space around it. 

The SIV interface on the left (Fig. 4. area 6) allows users to explore the domain 
ontology. It is an evolution of V1UM (for Visualisation of Larger User Models), a 
tool that can effectively display large user models in web-based systems. The V1UM 
interface has been extensively tested with user models consisting of up to 700 con- 
cepts. Users have been able to navigate around the user model and gain an overview 
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of the concepts inside it [251 . The interface has been modified to allow us to be able 
to visuali se ontologies. 

Users can navigate through the ontology by clicking on a concept to select it. The 
display changes so that the newly selected concept becomes the focus. A slider allows 
users to limit the spanning tree algorithm to the selected depth. In Fig. 5, for example, 
the main focus is heuristic evaluation, some secondary terms are consistency and 
time. The depth is currently set at 1 . Changing the depth will change the number of 
terms blurred out or unblurred on the visualisation. 




Fig. 6. The left panel shows the SIV visualisation from Fig. 5. The user has heuristic evaluation 
in focus. The related term usability inspection can been seen at the lower part of the display. 
The user clicks on usability inspection to bring it into focus. The visualisation changes so it is 
in focus - now the related terms have changed to be those of the new in-focus term as shown on 
the right. 



As an example of adding metadata to a learning object, a user Tanya wishes to an- 
notate the slide shown in Fig. 4. She reads the bullet point “Heuristic evaluation” and 
selects the text evaluation (which becomes highlighted as in Fig. 4.), then clicks on 
the Search Selected button (in area 3, Fig. 4.) to perform the search. Results are 
shown on the visualisation on the left hand side of the interface. Tanya now scans 
through the search results, and notices the term heuristic evaluation. She selects this 
term in the visualisation, and clicks on the Add Metadata Element button (in area 4, 
Fig. 4). A popup asks for confirmation before the term is associated with the slide 
(Fig. 5). The term is added to the slide, and now Tanya scans the other visible terms 
on the visualisation to see what else might be appropriate to describe the slide 
(Fig. 6). She sees the term usability inspection and selects it. The term is then added 
the same way as the previous one. 
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4.1 Mecureo 

Mecureo automatically constructs ontologies from existing dictionary or glossary 
sources [26]. It has been used to generate the ontologies for the SIV interface. An 
important motivation for this approach is that we want to ensure user control and 
system transparency. By building the ontology from a dictionary, we can always 
provide users with a human understandable explanation of any part of the ontology 
by presenting relevant dictionary definitions. Mecureo was developed in the context 
of the Free Online Dictionary of Computing (FOLDOC) [21] but can parse any semi- 
structured dictionary or glossary source. 

The word and definition tuples in the dictionary are parsed to create a digraph of 
keywords linked to related keywords that appear in the definition. Grammatical con- 
ventions in the definition are used to add typing to the links. For example, the defini- 
tion for declarative language contains the phrase “A general term for a relational 
language or a functional language, as opposed to an imperative language”. An anto- 
nym relationship is created between the concepts declarative language and impera- 
tive language in the digraph. The relationships are also weighted. Broadly speaking, 
these weightings are interpolated from the position of the words in the definition. 
Lower ratings are given to words that appear earlier in the definition, to represent a 
stronger link. 

A point query can be executed from a term in the ontology. A point query involves 
selecting a node (a point) in the graph, and a distance to expand out by. A spanning 
tree from this point is created, up to the specified distance. Mecureo also provides 
facilities to merge point queries into single subgraphs. Point queries are important as 
they allow us to extract a small, human manageable portion of the ontology as a sepa- 
rate ontology in itself. In the context of FOLDOC, there were over 14000 terms. We 
found in our evaluations that point queries consisting of around 300-500 terms were 
quite easy to work with and navigate in SIV (the evaluations on the original V1UM 
indicated it was effective with around 500 concepts). 

An important feature of Mecureo is that it allows a user to scrutinise the relation- 
ships between the words in the graph. It does this by linking each term back to the 
original dictionary definition. The definition should provide an excellent basis for 
explaining to a user why a link exists. For example, given a relationship between 
terms A and B, we can simply present the dictionary definitions that contain both A 
and B. In each, we can highlight the occurrences of the other term in the definition 
body. 

For this metadata annotation system, Mecureo has been used to generate an ontol- 
ogy from the Usability Glossary [20]. The process results in an ontology in the FICI 
domain consisting of 1 127 terms and 5876 directed relationships between them. 



4.2 Ontology Structure 

The ontology created by Mecureo is serialized as an OWL file. Mecureo uses its own 
set of concept and relationship definitions defined in OWL. Each concept is then 
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represented in OWL as shown in Figure 7 which gives the representation of the con- 
cept novice. All the links are directed, Mecureo creates reverse relationships in the 
linked terms. For example, the concept Fitts’ Law has a relationship hasParent to the 
concept input device, which in turn has a hasChild relationship back to Fitts’ Law. 
The OWL header for the Mecureo-generated ontology contains a number of proper- 
ties that represent the relationships that are used in the parsing process. The hasPar- 
ent and hasChild properties can be seen below. These contain a resource attribute that 
is the URL describing the related term (the additional encoding such as the %20 is 
present to make the term names legal URL suffixes). For example, novice has a rela- 
tionship to the concept hunt and peck. The relationship type is hasChild. So hunt and 
peck is a child of novice. 



cConcept rdf : ID= "novice" > 

<dc : Title>novice</dc : Title > 

chasParent rdf : resource="#use%20what%20you%20build"/> 
chasParent rdf : resource= " #natural%2 0 language%2 0search%2 Oquery " / > 
chasParent rdf : resource="#selection%20bias"/> 
chasChild rdf : resource="#shortcuts"/> 
chasChild rdf : resource="#hunt%20and%20peck"/> 
chasChild rdf : resource="#training%20wheels%20interface"/> 
chasChild rdf : resource="#idiot%2dproof "/> 
chasChild rdf : resource= "#side%20ef f ects " / > 
c /Concept > 



Fig. 7. The OWL representation for the term Novice showing relationships to other terms. In 
this case there are a number of parent and child relationships. 



By serializing the ontology in OWL format, this makes the output accessible to 
other ontology tools, for example the ontology editor Protege [27]. The SIV visualisa- 
tion only requires the URI of the OWL file to be able to visualise it, and could poten- 
tially use any arbitrary OWL file as input. This means that the ontology does not need 
to be generated by Mecureo. 



4.3 Supporting Metadata Terms That Do Not Appear in the Ontology 

The process taken by Mecureo to generate a directed graph of the terms in the dic- 
tionary involves making each term a node. It then scans through each of the defini- 
tions for terms that are also nodes and generates a link between them. The graph is 
gradually built up as more definitions are parsed until there are no more to do. 

This means that there will be many words that appear in the definitions that will 
not be in the final graph because they are not a term in the dictionary. As an example, 
the word novice appears many times in the Usability Glossary (such as in the defini- 
tion for hunt and peck) but is not a term because it is not in the glossary. 

As we have discussed, it is common that such a core term as novice would be a 
suitable metadata term, so we would like to be able to enhance the core Mecureo 
ontology to include it. So we have enhanced Mecureo to allow a user to create their 
own local dictionary additions. 
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The user-defined dictionary definitions are stored in a file format that can be read 
by Mecureo. An example of a definition in this file is given in Fig. 8. When the parser 
is run, the file containing the user defined terms is merged with the dictionary and 
parsed by Mecureo to create the ontology graph. These pseudo-terms need to be no 
more than just a declaration of the word as a term, and do not require a definition of 
its own since Mecureo will form links to and from the pseudo-term to existing terms 
through their definitions. 



Exploration 

[#exploration] 

Simulate the way users explore and learn about 
an {interactive system}. 

Related: {cognitive modeling} {learning curve} 
Categories: disability Methods> 



Fig. 8. An entry for the term exploration (declared on line 1). The second line is the URL iden- 
tifier for the term, followed by the definition and the related (existing) terms in the dictionary 
and which categories this term belongs to, respectively. 

Ideally, we would like to make the ontology enhancement process as lightweight 
as possible. The simplest approach would be to nominate a term, such as novice, to 
become treated as a new, additional term in the dictionary. Mecureo will then link this 
term within existing dictionary definitions to other parts of the ontology. It would be 
very good if this alone were sufficient. The first column in Table 1 shows the results 
of this very lightweight approach for the terms that we wanted to include as metadata 
for the collection learning object slides in the lecture on cognitive walkthrough. So 
for example, the first entry indicates that the concept novice users gives two links. 

The Term and Definition column of Table 1 shows the linkage to the new terms 
when we used the contents of the online lecture slide that demonstrates this concept 
as the definition. We do not force any related terms using the chain brackets. This 
means that links to other existing terms can only be inferred from the words appear- 
ing in the definition. 

The Term, Definition and Related column shows the linkage to the new terms 
when we use two keywords, in addition to the definition as just described. For exam- 
ple, the term exploration would appear in the user defined term list as it appears in 
Fig 8. Mecureo automatically creates relationships to terms that are enclosed in 
braces. In Fig. 8 these are { cognitive modeling j and (learning curve/. Essentially, this 
allowed us to ‘force’ a relationship between our new term and one or more existing 
terms in the dictionary. The other relationships have come from parsing the definition 
we provided for the term, and the definitions in the terms that Mecureo has found to 
relate to this term. 

One problem of just defining the Term name is that in the case of multi-word 
terms, Mecureo fails to find enough (if any) related terms, causing orphan nodes in 
the ontology graph. Bootstrapping the parsing process by giving the pseudo-term 
some existing related terms and a short definition minimizes this effect and gives 
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more favorable results. For the longer user defined terms, the lower number of links 
occurs because of the text matching level in the parser (which controls things such as 
case- sensitivity and substring matching). There is an obvious bias towards shorter 
words. 



Table 1. Added Term Linkage. 



Term 


Term name 
only 


Term and Definition 


Term, Definition 
and Related 


novice users 


2 


3 


5 


Discretionary users 


0 


1 


3 


casual users 


0 


1 


3 


Exploration 


9 


10 


12 


usability technique 


0 


1 


1 


testing process 


1 


2 


4 



5 Evaluations 

Through our own experiences and evaluations we have discovered that the un- 
augmented Usability Glossary has only a very small overlap with the terms used in 
the learning objects of the User Interface Design and Programming course; the course 
used less than 10% of the terms defined in the dictionary. This poor term coverage is 
attributed to two facts. 

Firstly, there are cases where we use slightly different terminology. For example, 
the term cognitive modeling in the glossary is used in a similar sense to the term 
predictive usability which is used in the course. 

The second problem is that there are some concepts that are considered important 
in the course and which are part of several definitions in the Usability First dictionary 
but are not included as actual dictionary entries. This was discussed in the previous 
section. 

We have run a series of evaluations of our approach. One that was intended to as- 
sess the usability of the Metasaur interface [28] indicated that users could use it effec- 
tively to search the SIV interface for terms that appeared in the text of online lecture 
slides. This also gave some insight into the ways that SIV might address the first 
problem, that of slightly different terminology. 

The participants were asked to annotate a set of lecture slides about an Introduc- 
tion to empirical usability. The participants had to read each of the slides in the lec- 
ture, and then select concepts that best represented the subject matter covered on 
them. The participants were a mix of domain experts and non-experts. From our ob- 
servations and questions answered by the participants, they were frustrated when 
words on the slides did not appear in the ontology. Domain experts navigated the 
ontology to find close terms to those words in contrast to non-experts who chose to 
continue onto the next slide. At this stage of the Metasaur development, no function- 
ality was available for users to add their own terms. A summary of the results are 
shown in Table 2. 
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Table 2. Summary of Experiment Logs. 



Subject 


No. 

Terms 


Total 

Time 

(min) 


No. 

Clicks 


No. 

Back 


No. 

Search 


No. 

Depth 


Click 

Conver- 

sions 


A 


13 


10.77 


3 


4 


34 


1 


0 


B 


16 


11.88 


0 


0 


38 


6 


0 


C 


19 


20.85 


8 


1 


55 


11 


0 


D 


32 


26.07 


12 


9 


42 


4 


2 


E 


11 


17.63 


1 


0 


26 


1 


0 


F 


43 


34.27 


20 


0 


84 


3 


2 


G 


36 


22.85 


15 


2 


55 


0 


3 



The results from the evaluation show the number of terms the user added, the time 
to annotate all the slides, the number of clicks (on terms in the visualisation, the back- 
button, the search button, and the depth buttons), and also the number of ‘click 
conversions’ - terms added having been discovered by navigating the ontology rather 
than through a direct search. 

There were a total of 8 slides to annotate (including the title and summary slides). 
In the case of the first user, they took nearly 1 1 minutes to annotate all the slides. A 
total of 13 metadata terms were added. 34 searches were made (by either typing in a 
term and clicking on the button or by highlighting text on the slide and using Search 
Selected), and back was pressed 4 times to backtrack through the terms they had 
selected in the visualisation. The user clicked on a total of 3 terms on the SIV display 
(not including search results) and the final column, Click Conversions, shows they 
added no terms from these clicks. From the seven participants, B, D and E had previ- 
ously used the interface. The participants B, D, F and G had a familiarity with the 
domain, and user D was in fact the author of the slides. 

The task our participants performed is akin to that of a librarian cataloguing a 
book. In both these cases, the participants or librarians are not reading the whole book 
(or listening to the audio on the slide). All participants made extensive use of the text 
searching facilities and added a reasonable number of terms. However, one identifi- 
able problem was that not all the participants were familiar with the domain, and were 
hesitant to add words when they did not understand the exact meaning (even if it was 
a word on the slide). 

The results show that only three users (D, F, and G) made use of the exploratory 
design of the visualisation to find new terms (the Click Conversions column in Ta- 
ble 1). They represent three of the four users who were familiar with the user inter- 
face design and programming domain. These three users also tended to add more 
metadata terms than the other participants. This suggests that the other users needed 
either more time to gain familiarity with the interface, and also a better understanding 
of the domain and task. 
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6 Creating User Models 

The website for the course collected a form of augmented web log. We have analysed 
this data to form one source of user modelling evidence. Essentially, if a learner has 
'attended' an online lecture, we treat this as evidence supporting the conclusion that 
they know concepts taught in that lecture. We analyse the web logs to model how 
well each student has ‘attended’ the lecture. 

All of the metadata created by Metasaur has been used as components in a user 
model. We use the Personis server software to store the models [10]. Every access to 
a slide is treated as a piece of evidence that the user has gained some understanding 
of the concepts that it teaches. These pieces of evidence are added to the user models 
in Personis. 

A resolver has been written to go through the evidence and determine a discrete 
number to represent the user’s “understanding” of the concept. We have chosen to 
aggregate the hits to particular slides in a learning object by the length of time a user 
has spent on it. This is to reflect the fact that the majority of the concepts are taught 
through the audio; to gain a full understanding of the concepts taught the students 
should have listened to all of the audio for the slide. Each visit by a student is com- 
pared against the actual audio time, and a weighting determined based on the follow- 
ing categories: 

• Seen - stayed for less than 10% of the audio length, weight = 0.1 

• Partial Heard - stayed more than 10% but less than 80% of the audio length, 
weight = 0.5 

• Full Heard - stayed more than 80% to 150% of the audio length, weight =1.0 

• Overheard - stayed for over 150% of the length of the audio, weight = 0.8 

We have penalised the Overheard visits slightly. This is to account for the times when 
students have become distracted with other activities and have left the browser open. 
Whether this penalty is justified or not will be investigated in future evaluations. The 
resolver then averages all of the evidence weights for a component over the number 
of slides that component appears on. This gives a value from 0 to 1 .0; a perfect stu- 
dent will have listened to every slide as a Full Heard, resulting in a value of 1.0 for 
the component. The resolved values can then be exported as an XML file that can be 
parsed by SIV. This XML file is essentially a list of concepts and a value that can be 
overlaid on the ontology visualisation to show a user model. An example is shown in 
Fig. 9. The concepts that are left aligned are the ones that appear in the course. For 
example, error is a term that is used as metadata for the course. The user model 
shown only contains the concepts in the first three lectures. 

7 Related Work 

There has been considerable interest in tools for ontology construction and metadata 
editing. This section briefly discusses some existing tools and contrasts them to our 
own research. 
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Fig. 9. A visualisation of a user model in SIV. 

The first part of our work is the extraction of an ontological graph structure from 
existing sources. The OntoExtract [16] tool creates lightweight ontologies from a 
corpus of documents (in their case, insurance documents in the Swiss Life company). 
The links between terms in the ontology have weightings like the ones created by 
Mecureo. However, the weights assigned by OntoExtract have increasing values as 
the ‘relatedness’ increases. Mecureo has lower weights for higher related terms. This 
is fundamental to Mecureo being able to generate subgraphs with a distance cut-off. 
When a subgraph is created, Mecureo generates a spanning tree from the selected 
node, summing the link weights until a specified limit (the ‘maximum distance’) is 
reached. This means that Mecureo's ontologies enable us to reason across a series of 
links. 

In [29], Jannink describes the conversion of the Webster’s dictionary into a graph. 
Relationships have a strength value based on their appearance in the definition, simi- 
lar to Mecureo. The major difference with our work is that the parsed Webster’s dic- 
tionary is quite comprehensive. The resultant graph contains lexical constructs such 
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as conjunctions and prepositions as nodes in the graph. There are three types of rela- 
tionships between the words in the graph determined by a heuristic that utilizes the 
strength value. In contrast, Mecureo determines the relationship type through some 
simple NLP and pattern recognition. This means that it tackles a quite different style 
of ontology with more generic concepts modelled where we have purposely chosen to 
focus on specialised dictionaries since they are better suited to the markup of learning 
objects in a particular domain. It is not clear whether approaches that are suited to 
basic generic concepts should be particularly suited to our more specialised concept 
sets. 

AeroDAML [30] is a tool that automatically marks up documents in the 
DAML+OIL ontology language. The amount of automation can be varied to suit the 
level of user interaction. Technical users are more likely to use a semi-automated 
approach to annotating the metadata, where non-technical users might prefer an auto- 
matic approach. AeroDAML uses WordNet upper level noun hierarchy as the 
ontology, in contrast to Metasaur’s ontology built from any online dictionary or glos- 
sary source. 

The SemTag [31] application does semantic annotation of documents, designed for 
large corpora (for example, existing documents on the Web). SemTag stores the se- 
mantic annotations on a server separate from the original document as it does not 
have permission to add annotations to those files. In contrast, Metasaur has been 
designed to be used in an environment where the metadata authors do have access to 
write to the existing content. Importantly, the nature of the evaluation of their system 
is inherently different from our own. They have asked arbitrary users to check and 
approve large numbers of semantic links constructed as a means of evaluation. We 
have taken a more qualitative approach with the metadata checking being performed 
by the teacher who wants to be in complete control of the metadata associated with 
the learning objects they use in their own course. Perforce, this means that we have 
done a much less extensive evaluation but one that sets much more exacting stan- 
dards. 

Another very important element of the current work is that text available of the 
learning objects is just a small part of the learning object. The bulk of the content of 
most slides is in the audio ‘lecture’ attached to the text of each slide. If we were to 
aim for automated extraction of metadata, that would require analysis of this audio 
stream, a task that is currently extremely challenging with current speech understand- 
ing technology. But even beyond this, the accurate markup of the metadata is chal- 
lenging even for humans as we found in our earlier evaluations [28] where less expert 
users made poor choices of metadata compared with relatively more expert users, 
who had recently completed this course satisfactorily. This later group defined meta- 
data that was a much better match to that created by the lecturer. Indeed, this is one 
reason that we believe an interesting avenue to pursue is to enhance the interaction 
with the learning objects by asking students to create their own metadata after listen- 
ing to each of the learning objects. Checking this against the lecturer's defined meta- 
data should help identify whether the student appreciated the main issues in that 
learning object. 
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The novel aspect of our work is the emphasis on the support for the user to scruti- 
nise parts of the ontology. Users can always refer to the original dictionary source 
that the ontology was constructed from since all the relationships are inferred from 
the text. Because the dictionary source is online, it is easily available to the users, and 
changes to the ontology can be made either through the addition of new terms, or 
regenerating the ontology with a different set of parameters for Mecureo. 



8 Discussion and Conclusions 

We have performed a number of evaluations on SIV and Metasaur. Evaluations indi- 
cate that users were able to navigate the graph and use it as an aid to discover new 
concepts when adding metadata. The evaluation of Metasaur described in [28] was on 
an earlier version of the system that did not allow users to define their own terms. A 
larger evaluation is currently being planned that will incorporate the ability to add 
new concepts to the ontology. 

The Metasaur enhancements that enable users to add additional concepts, with 
their own definitions, are important for the teacher or author creating metadata for the 
learning objects they create and use. An interesting situation arises when users have 
different definitions for terms - they will be able to merge their definitions with the 
core glossary definitions at runtime, resulting in a different ontology for each user. 
Users could potentially share their own dictionary or use parts from other user’s dic- 
tionaries to create their own ontologies. 

We believe that Metasaur is a valuable tool for aiding users mark up data. For 
teaching, it will not only be useful to instructors wishing to add metadata to learning 
objects, but also to students who will be able to annotate their own versions of the 
slides, providing potential to better model their knowledge for adaptation. The user 
defined dictionaries enrich the base ontology resulting in better inferences about the 
concepts. In our teaching context this means the metadata will give a higher quality 
representation of the learning objects allowing for better user models and adaptation 
of the material for users. 

In this chapter we have described a system for building user models for students 
based on their access to multimedia learning objects. We have described the main 
elements of the system, Mecureo for building the ontologies, SIV for visualising them 
and Metasaur which uses SIV to support metadata annotations of learning objects. 
The current version of Metasaur provides an easy way to navigate the domain ontol- 
ogy and to create of metadata. The enhancements made to allow users to add terms to 
the ontology results in a higher quality representation of the concepts taught by the 
course. The terms used as metadata are used as the basic components in the user 
model, and inference is possible because of the underlying ontology structure. The 
user model provides a way for the course to adapt its material to suit the user’s cur- 
rent level of understanding; the visualisation allows users to scrutinize their model 
and reflect upon their own learning. 
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Abstract. We present the conception and a partial implementation of 
a personalized interaction environment for scientific information portals. 
This environment will support the user in finding documents via the por- 
tal and also in managing and working with these documents once they 
have been retrieved. To this end, it will combine existing techniques from 
document management systems, automatic reconnnenders, and brows- 
ing assistants via an underlying user model that represents the current 
interests of the user. The environment is currently in its early implemen- 
tation phase, with an early version of the document management tool 
completed. 



1 Introduction 

There is a vast amount of information available via today’s digital portals and 
libraries, and advances in user interfaces as well as search and browsing technolo- 
gies have simplified gaining access to this information. But many issues remain 
if we aim for information portals which are not only easy to use, but also effec- 
tive in providing the information and, going a step further, facilitating access to 
the knowledge we actually need. For example, think about your current research 
project and the issues you are trying to solve. How would you query your fa- 
vorite information portal for publications that might already include the answers 
to these issues? In particular, how would you formulate your query? And what 
about evaluating the query results? The query system of the portal has only the 
few terms you supplied in your query to rank its results, so taking a look at 
the ten highest ranked results will probably not satisfy your information need. 
Then, when you have found a number of potentially relevant articles, what do 
you do with them? Do you store them in some folder on your hard disk - to be 
forgotten, because before having time to take a closer look, you have to prepare 
a lecture about an unrelated subject? 

Interaction with information does not end with retrieving relevant documents 
[1], but also includes making sense of the retrieved information, organizing col- 
lected materials for near- or long-term use, and sharing insights with colleagues. 
With the work described here, our goal is to create an interaction environment 
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which facilitates these tasks by combining different techniques that have al- 
ready been developed for each separate task and leveraging them by providing 
a common representation of the user’s interests and current working context. 
In particular, the system we envision consists of a tool for organizing personal 
document collections, an automatic recommender, an information filter, and a 
browsing assistant. Sharing data about the user among these different compo- 
nents will, for example, allow the recommender to increase the accuracy of its 
recommendations by taking the user’s current working context in the document 
manager into account. Thus, in addition to directly supporting the user by the 
functionality provided, the system can utilize the additional data about the user 
to provide contextualized and personalized support. 

The system described in this paper is currently in its early implementation 
phase, with only an early version of the document management tool completed. 

In the following, we first present a short survey of related work on which we 
build. In Section 3, we provide additional motivation for our work, which is then 
described in more detail in Section 4. In Section 5, we take a closer look at the 
conception of the user model. 

2 Related Work 

Browsing assistants, recommender systems, and information retrieval in general 
are well-researched subjects. This short survey of related work is therefore limited 
to a small subset of publications in these fields of research, pointing out some of 
the major systems and techniques we have built our work on. For a more general 
survey, see Baeza-Yates et al. [2] and Jones et al. [3]. 

Document management and sense-making were studied, for example, by Mal- 
one [4] and by Kidd [5]. Marshall et al. [6-8], Shipman et al. [9], and Furnas et 
al. [10] implemented and evaluated electronic document management systems. 
Their results and suggestions guided our conception of our own management 
system. 

The work of Perlin et al. [11] and Bederson et al. [12, 13] on zoomable user 
interfaces and the resulting tools provide the basis for the main interaction com- 
ponent of our document management system. 

Wolber et al. [14] present a tool which organizes personal document collec- 
tions on the basis of links between documents, thus allowing the user to browse 
the context of documents within the collection. They distinguish three kinds of 
relationships: “direct-link”, “directory”, and “content”. Our document manager 
allows the definition of semantically richer links such as provides-theoretical- 
background-for between documents, which should further support the user in 
capturing part of her knowledge in the structures created. 

Middleton et al. [15] and Pennock et al. [16] present two of many recom- 
mender systems which have been used to recommend papers from a collection of 
scientific literature. A survey of hybrid recommenders, which use a combination 
of different recommendation techniques, is provided by Burke [17]. We intend to 
utilize a hybrid approach in order to fulfill the different recommendation needs 
of users of scientific information portals. 
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TalkMine [18] builds a “knowledge context” from the user’s information re- 
trieval history. A knowledge context consists of a keyword/record matrix, the 
structure of the document set (as defined by citation links, among others), and 
proximity relations (e.g. co-citation). TalkMine uses knowledge contexts to char- 
acterize users as well as information sources. Recommendations are then based 
on the integration of information from the user (supplied query terms and knowl- 
edge context) and from the information sources. In addition, knowledge contexts 
of information sources are adapted on the basis of the contexts of their users. As 
was mentioned above, using our document manager, the user can define seman- 
tic relationships between documents explicitly. We intend to utilize the semantic 
networks defined by each individual user to construct a network which captures 
part of the knowledge of the entire user community. The latter can then be used 
by, for example, the recommender system to provide more relevant recommen- 
dations. 

Hanani et al. [19] provide a survey of information filtering systems. News 
Dude [20] is a news-story filtering system that explains its decisions to the user. 
It utilizes a long-term user profile to represent the general interests of the user 
and a short-term user profile to quickly adapt to the user’s information need 
when the user has read a story. The user model we propose also consists of 
different temporal layers, so that it can represent long-term interests as well as 
information needs relevant only for the current interaction. 

Letizia and PowerScout [21] are two browsing assistants which recommend 
links on the basis of on the user’s browsing behavior. These assistants analyze 
the content of the visited pages and, in the case of Letizia, evaluate pages reach- 
able via hyperlinks from the current page, or, in the case of PowerScout, query 
a search engine for related pages. The browsing assistant in the system we are 
developing will propose paths through the information portal as well as rec- 
ommending information items based on a user model adapted according to the 
user’s behavior on the pages of the portal. 

The Outride system [22] attempts to individualize and contextualize search 
results of web-based search engines by query augmentation and result processing. 
It builds up a user profile on the basis of the user’s information retrieval history 
and browsing behavior, among other data. Similarly, our system will utilize the 
user model to augment the user’s queries and evaluate resulting documents. 

3 A Vision of Future Interaction 

with Electronic Scientific Information 

Imagine the following situation: You are working on a project which aims to 
combine methods for automatically learning regular expressions with spatial 
reasoning to create an improved web- mining tool. You have already collected 
some relevant literature about the different subjects related to your work. And 
now, you want to check whether there are other people working on similar ideas. 
To this end, you query Google 1 for ‘ 'spatial reasoning’ ’ ' ‘web mining’ ’ 

www . google . com 
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and receive a large list of pages containing the supplied terms. But browsing 
through the first page of results, you do not find any document in which web 
mining and spatial reasoning are mentioned in the same context. After some 
additional unsuccessful attempts, you try ' 'web crawling’ ’ layout, because 
these are two other terms often used in the papers you have collected, and indeed, 
this time you discover relevant publications. 

If the search engine has access to your current working context, it can assist 
you in this task by trying to identify characteristic terms describing your current 
work and attempting to rewrite your query in different ways. 

Now, let us assume you start browsing from the list of query results. Even- 
tually, you discover a page containing the proceedings of a conference whose 
general theme seems to be relevant to your research project. For most of the 
included papers, the information provided in the table of contents is not suffi- 
cient to assess their relevance to your current work. So, you start to take a closer 
look at all of the papers that you cannot dismiss only by their titles, reading 
abstracts and conclusions to reduce the set of relevant candidates. These remain- 
ing documents seem to be very promising and you add them to your personal 
collection. 

If a browsing assistant picks up where the support of the search engine ends, 
it can use the information you have supplied so far to evaluate the documents in 
the proceedings for you. It could mark the documents which are most probably 
of interest and give a short explanation of this assessment so as to enable you to 
decide whether to accept the recommendation or not. 

After downloading the papers, you decide to take a look at the proceedings 
of earlier years of the same conference. And indeed, there seem to be further 
relevant publications. Not having the time to check the proceedings for all pre- 
vious conferences, you decide that it has to be sufficient to take a closer look at 
last year’s conference. But after manually evaluating the documents, you notice 
that the only paper which seemed to be of interest is actually quite similar to a 
paper you have already read and does not provide any more insights. 

If a recommender system takes your working context, query, and browsing 
behavior into account, it can recommend articles from the entire conference 
history which are of value to you. In particular, it can filter out documents that 
are too similar to those that you have already read. 

Continuing browsing, you find a number of other relevant information sources, 
among them pages of journals which are regularly updated. Since you want 
to stay up to date on any new developments throughout the duration of your 
project, you bookmark all of the relevant pages and decide to check them once 
in a while for new information. 

If an information filter takes your working context and interaction with the 
other system components into account, it can provide an effective (and accurate) 
means of staying up to date on the topics your are interested in. 

Providing this enhanced support for retrieving information is one of our main 
goals in developing the system described in the following sections. 
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4 System Conception 

With the interaction environment that we envision for scientific information por- 
tals, we aim to support the individual user not only in retrieving information 
from the portal, but also with organizing, managing, and using the retrieved 
information. The system will integrate into a common environment several tech- 
niques, each of which covers a different part of the user’s interaction with infor- 
mation: 

— A document management tool which resides on the user’s desktop and sup- 
ports the user in creating and utilizing a personal information collection. 

— A recommender which proposes information items related to the user’s cur- 
rent recommendation need. 

— An information filter which assists the user in staying up-to-date on specific 
topics. 

— A browsing assistant which supports the user in browsing the information 
portal and exploring query results. 

An underlying user model provides the glue between these components. 



4.1 Personal Document Management Tool 

The personal document management tool aims at supporting the user in manag- 
ing and making sense of a personal document collection. To specify the require- 
ments for this tool, one has to understand how people work with their documents. 
This is a well-researched issue, and from the studies published in the literature 
(e.g. [4,5,8]) and interviews with students and researchers at DFKI, we derived 
the following main requirements for the document manager: 

— People use spatial organization to express relationships between (physical) 
documents, so the document manager should provide means of spatially 
organizing document representations. 

— When collecting documents, users often cannot immediately decide where to 
file or how to classify these documents. The tool should facilitate the easy 
creation of informal structures out of newly discovered documents. 

— People tend to organize their documents into hierarchical structures. The 
tool should support this activity. 

— People annotate documents to make it easier for them to remember what 
was important about the information at a later time. Hence, the tool should 
allow users to associate an annotation with each document. 

— Annotations are also used to express more specific relationships between 
documents (e.g., “Document A describes a user evaluation of the system 
presented in document B”). The document manager should explicitly sup- 
port the definition and browsing of these kinds of relationships. 

— The tool should support the user in finding documents in the personal col- 
lection. 
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The first three requirements led us to the adoption of a two-dimensional, 
zoomable plane as the central component of the document management tool 
(see figure 1). Similar to systems like VIKI [7] and NaviQue [10], The user will 
be able to place icons representing documents freely on this plane, and create 
named groups containing documents or other groups. 




Fig. 1. The document management tool allows users to place objects freely on a 2- 
dimensional plane, associate them with notes, and mark their importance (with color 
coding). To simplify browsing and navigating within large document collections, the 
user can zoom the plane in and out 



The user will be able to associate each document within the personal col- 
lection with a text file. This file is intended to store in-depth notes about the 
document in question. In addition, a user will have the possibility of placing notes 
on the plane, which are intended for short annotations of spatial structures, as 
opposed to single documents. 

The user will be able to define detailed semantic relationships between doc- 
uments via a simple drag-and-drop interaction. For example, if the user wants 
to express that document A describes an improved version of an algorithm orig- 
inally described in document B, she drags and drops the representation of A 
on top of B, which will not move A on the plane but initiate a dialog in which 
the user can select the relationship she wants to define. This way, the user can 
build up her personal knowledge base, capturing part of her knowledge about 
the semantic relationships between documents. 

The document manager will provide means for browsing and querying the 
constructed knowledge base; thus the knowledge base provides immediate value 
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Fig. 2. Section of a semantic network representing the relationships between objects 
in the document management tool 



to the user. But it is also valuable to the system as a source of information about 
the user’s knowledge and current working context. 

4.2 Recommender and Information Filter 

The user will at different times have different recommendation needs. In the 
context of a portal to scientific information, these will include: 

— Finding current publications related to the topics that the user is working 
on. 

— Finding publications similar to those the user currently works with. 

— Finding a certain type of publication (e.g., an empirical study) related to a 
given document or working context. 

— Finding documents pointing to new ideas and solutions. 

To fulfill these recommendations needs, the system requires different rec- 
ommendation strategies. To this end, we plan to implement knowledge-based, 
content-based, and collaborative-filtering recommendation components, and 
combine these to create a hybrid recommender [17] in a way that is suitable 
in light of the user’s specific recommendation need. For example, if the user is 
interested in documents similar to those she is currently working with, using 
a combination of a knowledge-based and a content-based recommender is more 
sensible than using a collaborative recommender, since the collaborative recom- 
mender is more likely to recommend documents only marginally related to what 
the user is thinking about. But this same characteristic is valuable when the user 
is interested in new ideas. 

Recommending documents is one domain in which the semantic networks 
constructed by the individual users of the information system can be utilized: 
By collecting and combining the (sub) networks supplied by the users, the recom- 
mender can learn a semantic structure on top of the set of relationships between 
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documents already provided by the information portal (an example for the latter 
are reference relations). We expect this additional layer to be very valuable for 
increasing the quality of recommendations and the interaction between the user 
and the recommender. For example, the recommender should be able to provide 
more intelligible explanations for why a particular information item has been 
recommended. 




improves on [ v ' 

v / 

i 

Are there documents improving on C? 



A might be what you are 
looking for! 

Fig. 3. The recommender utilizes fragments of semantic networks constructed by its 
users to fulfill more specific recommendations needs 





Recommender 



The information filter will use the same approach as the recommender, but it 
will continuously be evaluating incoming documents in terms of their relevance 
to the user. 



4.3 Browsing Assistant 

The browsing assistant will accompany the user while she is browsing the in- 
formation portal and recommend actions (e.g., following a hyperlink or reading 
a document). The recommendations of the assistant are limited to the options 
available to the user on the current web page. 

The support of the assistant will pick up where the functionality of the por- 
tal’s query system ends: After receiving the list of results, the user has to evaluate 
their relevance to the information need. Doing this, the user learns more about 
the information available via the portal, which in turn leads to a change in the 
information need. The assistant attempts to adapt the system-internal repre- 
sentation of the user accordingly on the basis of the user’s interaction with the 
portal. 
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Since the assistant has to rely on uncertain evidence (e.g., the time the user 
stays on a specific page) to adapt the user profile, it is important that inaccurate 
recommendations should not interfere with the user’s task. Accordingly, the 
assistant will not modify the structure of the page the user is looking at but only 
highlight its recommendations. In addition, the assistant will always display a 
description of the current representation of the information need, so the user can 
quickly decide whether the recommendations are likely to be accurate or not. 
Finally, the user will be able to deactivate the assistant quickly. 

5 User Model 

The system will maintain an internal representation of the user’s interests in or- 
der to be able to fulfill the user’s information needs effectively. This user model 
is the central means of interaction between the various proposed components: 
The document management tool will adapt the model on the basis of the user’s 
interaction with the document collection and hand it to the recommender when- 
ever the user formulates a recommendation need. Then, when the user starts 
browsing the information portal, the browsing assistant fetches the profile from 
the recommender and adapts it according to the user’s navigational behavior. 

5.1 Requirements 

The user model will be used by the recommender, the information filter, and 
the browsing assistant. Modifications will be performed by the aforementioned 
components as well as by the document management tool and the user. Hence, 
the model has to fulfill several requirements: 

— It has to make sense to the user, so it can be inspected and, if necessary, 
corrected. 

— The user will maintain a set of working contexts. The user model has to take 
the active context into account. 

— It has to be comparable to other user models in order to be usable by a 
collaborative filtering recommender. 

— It has to be mappable into the ontology of the information portal, so it can 
be utilized by a knowledge-based recommender. 

— It has to be mappable into document space 2 , so that it can be utilized by a 
content-based recommender. 

— It has to be as accurate as possible, in order for the recommendation com- 
ponents to provide accurate recommendations. 

— It has to be quickly adaptable in order for the browsing assistant to be of 
value to the user. 

Some of these requirements contradict one another: A user model that adapts 
quickly on the basis of uncertain evidence will not be very accurate, and a model 

2 Most text-document retrieval system utilize the vector space model, in which docu- 
ments are represented by weighted term vectors. 
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that can be mapped into document space is probably not very intelligible to the 
user. Accordingly, we propose a layered user model, where each layer provides a 
subset of the required characteristics. 

5.2 Conception 

On an abstract level, the proposed user model consists of three layers (see fig- 
ure 4): 

— A general, fairly accurate, slowly adapting long-term profile. This layer rep- 
resents the user’s general interests in terms of research areas. For example, 
the long-term profile might state that the user is quite knowledgeable in “Ap- 
plied Machine Learning” and “User Modeling” but less interested in their 
theoretical foundations. 

— A more specific medium-term profile which represents the user’s current 
working context. It consists of the most specific research areas covering the 
documents in the context, characteristic attributes extracted from the asso- 
ciated bibliographic information (such as authors and conferences), as well 
as a set of terms which are representative of the documents in the con- 
text. For example, a context of a user working in uncertain reasoning might 
contain references to the research areas “Bayesian Networks” and “Markov 
Processes”, the “Uncertainty in AI” conferences, and terms like “reasoning”, 
“temporal models” , “memory” , and “complexity” . 

— A very specific, quickly adapting, but possibly less accurate short-term pro- 
file. It represents the information need during the user’s current interaction 
with the system, and contains the same kind of information as the medium- 
term profile, but puts a stronger emphasis on document terms and keywords 
than on more general ontological concepts. The main difference from the 
medium term model is that the short term model is adapted after each 
interaction with the portal, which means it has to cope with sparse and 
uncertain evidence as well as with hard time constraints for analyzing it. 

As an example of the utilization of this user model, take a possible strat- 
egy for recommending related material: To fulfill this recommendation need, 
the recommender could use a cascade of knowledge-based, collaborative, and 
content-based modules. First, the relevant concepts of the ontology provided by 
the user model are used to focus on the associated set of documents. Next, doc- 
uments used by users with similar profiles are selected from this set. Finally, the 
set of relevant terms stored in the profile is used to identify those documents in 
the reduced set which are most closely related to the user’s work. 

Having a model that consists of three layers raises the issue of how to combine 
these layers. The long-term model is fairly accurate but too general to be of much 
use on its own. On the other hand, the short-term model is the most relevant 
one for the current interaction, but it can be quite inaccurate. One possible 
technique for combining the profiles is to weight each concept and term in each 
profile according to the confidence that the system has in its appropriateness and 
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Fig. 4. We propose a three-layer user model, in which layers are differentiated by their 
generality, speed of adaption to new evidence, and likeliness of accuracy. In connection 
with an information retrieval task, we expect a general profile to provide high recall 
but low precision and a more specific profile to provide low recall but high precision 



then to use a weighted sum over the layers to create final weights for the profile’s 
contents. Thus, when the user has just started browsing though the portal, the 
confidence in the short-term profile will be low and the recommender will rely on 
the medium- and long-term profiles. But after the system has collected additional 
evidence, its confidence in the short-term profile will increase and it will gain a 
greater influence on the overall behavior of the system. 

The three-layer model constitutes an abstract view of the user model. The 
basis for the concrete implementation of this model is the personal knowledge 
base that the user constructs with the help of the document manager. On the 
basis of the concepts in the knowledge base, its structure, and the user’s in- 
teraction with it, the user’s knowledge and interests in certain domains can be 
automatically assessed. 

For example, the system could identify topic areas of general interest to the 
user by identifying cohesive subgroups of information items and then selecting 
the smallest set of topic areas that covers these subgroups. The system could 
identify working contexts, which constitute the medium-term profile, by starting 
from the information items most recently used, following relationships to other 
concepts up to a certain distance. A currently selected item and its properties 
could be used to initiate the short-term profile. 

Utilizing the user’s knowledge base to build the user model simplifies the 
communication of the user model to the user, since all objects and their re- 
lationships that are used to create the model are represented in the document 
manager and are therefore familiar to the user. This makes possible explanations 
of the form “Your interest in machine learning is assessed as high because you 
collected a lot of documents related to this subject and rated most of them as 
interesting.” . 

But assigning concepts from within the knowledge base to the three layers 
of the user model is only the first step. Recall from the requirements listed 
above that we want to compare different user models or map parts of them into 
document space. Thus, the system has to process the collected data further by, 
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for example, computing representative term vectors for subsets of documents in 
the model. 

An issue we have ignored so far in this discussion is the interaction with 
the information portal. This interaction is important if the system is to be able 
to capture the short-term interests of the user. Assuming that the information 
portal utilizes its own general knowledge base, the user model has to be mapped 
into this knowledge base, that is, concepts corresponding to the user’s current 
interests have to be identified. As the user browses through the information 
portal and learns more about the subject of interest, the user’s information need 
changes. To capture this change and update the short-term model accordingly, 
we intend to utilize path analysis algorithms as proposed, for example, by Chi 
et al. [23]. 

Although we have decided on using the general approach described above for 
constructing and maintaining the user model, the final decision on which specific 
algorithms to use in the different steps has not yet been made; it will depend in 
part on the results of user studies. 

6 Contributions and Issues for Further Work 

We have presented an overview of the conception of an interaction environment 
for scientific information portals that is currently in its early implementation 
phase. The following issues are among those that we are currently working on: 

— Information visualization: How can a personal document collection best be 
visualized so that it is easy for the user to build up and maintain semantic 
structures and to search and browse the collection? 

— User modeling: Is the suggested user model appropriate for the environment 
that we envision? How should uncertainty be managed in the model? What 
types of concrete evidence should be used to update the profiles, especially 
in the case of the short-term profile? 

— Adaptation: How can we make the adaptation processes transparent and 
controllable, so that the personalized environment will be accepted and used? 
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Abstract. In this paper we present NewsFlash, an adaptive search system that 
assists a searcher to efficiently search a library of stored TV news reports. The 
system automatically summarises the closed-caption subtitles embedded in the 
TV broadcasts and selects appropriate sentences to best describe report content 
in respect to the searcher’s query. During interaction the system selects useful 
terms from these summaries and uses these terms to update the display and po- 
tentially update a stored searcher profile. We evaluate the worth of our ap- 
proach with real searchers and realistic information seeking scenarios. A novel 
means of testing the worth of a permanent profile of searchers’ general interests 
is also proposed. Our results show that the adaptive techniques we propose can 
work well in multimedia search environments. 



1 Introduction and Motivation 

We are all active consumers of information. However, our thirst for knowledge can 
soon turn into a glut if we are not careful about the amount of information we en- 
deavour to consume. Information overload [8] is a well-recognised, common problem 
and to tackle it we must be selective in how we choose to allocate our cognitive re- 
sources and manage our time. Television news is perceived as a means of reducing 
this overload and bringing only the important stories to the attention of viewers. The 
advent of 24-hour TV news channels has dramatically increased the available infor- 
mation and in doing so has placed increasing demands on viewers to filter out irrele- 
vant stories and make optimal use of their time. When attempting to locate stories of 
interest, viewers must often examine many that may not match their current informa- 
tion need or general interests. 

Multimedia Information Retrieval (MMIR) systems are a way of helping searchers 
locate information of interest from large multimedia corpora. The use of many such 
systems requires searchers to explicitly devise queries that represent their needs in a 
language understood by the search system. 

However, queries are only an approximate, or ‘compromised’ information need 
[13], and may fall short of the description necessary to infer relevant media objects 
(e.g. images, videos). The problems searchers have with expressing their information 
needs have been acknowledged [6], These difficulties are compounded by the diffi- 
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culty of providing increasingly better ranked results based solely on the initial query. 
Consequently, search systems need to offer robust, reliable methods for query modi- 
fication. 

Relevance Feedback (RF) has become a popular method for automatically improv- 
ing a system’s representation of a searcher’s information need. Searchers view a 
number of media objects and provide explicit feedback on which objects are relevant. 
The RF system is then able to formulate a query more attuned to the actual informa- 
tion need of the searcher and provide results that pertain more closely to this need 
(i.e. similar to those marked as relevant). Adaptive search systems unobtrusively 
monitor search behaviour and remove the need for the searcher to explicitly indicate 
which objects are relevant or formulate revised queries. They develop and enhance 
their knowledge of searcher needs incrementally from inferences made about their 
interaction and use this knowledge to help searchers in their seeking. 

In this paper we present NewsFlash, a prototype adaptive search system for online 
TV news that automatically offers searchers alternatives and extensions to their initial 
query based on their interaction. The system allows searchers to access large quanti- 
ties of news in an efficient way, using story detection to combine news reports from 
different broadcasts, hence reducing redundancy. The system monitors the ephemeral 
interaction of the searcher, and based on their actions, offers them support in their 
seeking. The fundamental assumption behind the adaptive approach we adopt is that a 
searcher’s actions are driven by their information need and that actions reflect needs 
as needs influence actions. 

The system operates on two temporal levels, combining a permanent, modifiable 
profile, unique to the searcher, with adaptive interface technologies and the selection 
of potential query expansion terms based on recent interaction. In response to 
searcher interaction, the system can reorder a list of news stories and recommend 
terms to be added to a searcher’s permanent profile. 

Systems that personalise the interaction with video catalogues are becoming in- 
creasingly popular [4] [5] . Such systems use stored searcher profiles and require the 
searcher to enter and reformulate queries explicitly. This relies on the searcher being 
able to adequately express their information needs, something that they may not al- 
ways be able to do. The use of relevance feedback for video retrieval is also popular, 
and attempts to tackle this problem, yet still relies on the explicit relevance assess- 
ments of searchers to automatically reformulate the initial query using, for example, 
content-based techniques (e.g. colour features, histograms) [2][9], This approach is 
effective but onerous, and as a result searchers may be unwilling to provide such 
feedback. 

Fischlar-News [12], an extension to the Fischlar [7] video search system, facili- 
tates access to large volumes of daily evening TV news. Fischlar-News has been 
operational for two years and offers searchers access to a large database of stored 
news, indexed with the closed-caption subtitles of the programs. The system requires 
the searcher to enter a textual query and returns a list of potentially interesting stories. 
The system does not offer any assistance in refining or enhancing searchers’ initial 
queries and no means of storing the general interests of particular searchers. The aim 
of this paper is not to present an alternative to well-developed technologies such as 
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Fischlar-News, but to focus on a particular aspect of the overall search experience 
that may prove helpful to searchers and that systems such as Fischlar-News currently 
lack, namely adaptive querying. 

To evaluate the worth of our approach we solicit real searchers, and present them 
with realistic information seeking scenarios. We compare searcher effectiveness be- 
tween a baseline and systems that implement varying degrees of adaptivity. 

In the remainder of this paper we describe the NewsFlash system and the adaptive 
methods, Section 2, and the evaluation methodology in Section 3. The results of our 
evaluation are presented in Section 4, are discussed in Section 5 and we conclude in 
Section 6. 



2 NewsFlash 

The NewsFlash adaptive search system was developed to meet a perceived need for 
TV viewers to be able to search through hours of video in the minimum of time, al- 
lowing them to view the stories they want without having to view the news pro- 
grammes in their entirety. 

The system is divided into two main components; Indexing (responsible for the 
offline processing of the video) and Search (responsible for the online searcher inter- 
action). In this paper we focus on the Search components and in particular the adap- 
tive querying aspects of the search interface. However, it may be helpful for context 
and later reference to briefly describe the subcomponents that combine to form the 
Indexing part of the system. 

We split the processing and retrieval of the videos for performance reasons alone. 
The overheads involved in processing the images were too great to practically process 
video ‘on the fly’ at query time. 



2.1 Indexing 

The Indexing component handles all aspects of video segmentation. There are five 
parts of this component (Fig. 1). The parts operate in the order illustrated in the fig- 
ure. 




Fig. 1 . Indexing components 



The shot detection mechanism uses the comparison of colour histograms to detect 
shots in a video. It takes two frames from the video and compares the overall level of 
each colour within each frame. If the difference exceeds a certain threshold, then it is 
inferred that the two frames originated in different shots. In NewsFlash, we initially 
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grab 100 keyframes, evenly distributed throughout the video. The shot detection then 
passes through this set, calculating the exact location of the shot transitions. The 
method used is similar to that in [3]. 

The video then needs to be categorised, an essential requirement of the search 
process. NewsFlash carries out text searches on the soundtrack of the video (encoded 
in the closed-caption subtitles broadcast with the news). These subtitles take the form 
of a text file with metadata and the text of the soundtrack and associated timestamp. 
The subtitle alignment is carried out after the shot detection, since the boundaries 
between shots will be known and timestamps from subtitles and boundaries can be 
aligned. Captions that are deemed to be within the shot are added to the text of that 
shot. 

The shot detection often results in a series of fractured clips that individually do 
not form a coherent story. NewsFlash merges shots to present one video per story. 
The videos are clustered by analysing the textual content of the shots and if they are 
sufficiently similar they are deemed to be part of the same story and the shots and 
their text are merged together. 

To allow the searcher to view search results in the web-based tool, a medium of 
presentation was developed where the searcher is shown a thumbnail of the video. 
This means that some still images of the video report had to be stored to allow the 
searcher to quickly preview the story. The system stores the first frame and the mid- 
dle frame of the story, to allow searcher’s to preview what happens as the story un- 
folds. The middle frame is shown in place of the first frame on thumbnail mouseover. 
Thumbnail generation is not query dependent, so the same thumbnail is shown re- 
gardless of the query entered. 

An inverted index is constructed of the textual content of the video to allow fast 
and efficient searching. The text of the video is represented internally using a stan- 
dard inverted index and postings list. 



2.2 Searching 

The main functionality of the NewsFlash system is embodied in its adaptive search 
mechanism. Terms are entered explicitly by the searcher, or chosen implicitly by the 
system’s adaptive components and acted on by the search mechanism. The result of 
these searches is a series of ranked stories with associated textual summaries and still 
images that illustrate them. The interface presented in response to a submitted query 
is shown in Fig. 2. 

The interface contains a number of features, the top 12 stories, ranked in descend- 
ing order of relevance to the searcher’s initial query (left to right, top to bottom), the 
video playback panel (for playing the selected video), the query reformulation panel 
and the profile search panel. A searcher views a video by clicking on the thumbnail 
for that story. When the searcher moves the mouse over the thumbnail they are shown 
the middle shot of the story and a four sentence summary. The summary is biased 
toward the initial query of the searcher and potentially consists of four query-relevant 
sentences. 
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story thumbnail 



pop-up summary 



Fig. 2. NewsFlash results interface 



The system supports two main forms of search; a plain text search and a profile 
search. The plain text search removes stopwords from the searcher’s explicit textual 
query and the remaining terms are checked against the entries in the inverted index. 
The stories are scored using a best-match tf.iclf weighting scheme and the top 12 re- 
sults are presented to the searcher, ranked in descending order of their relevance 
score. In the profile search, NewsFlash uses a stored profile, unique to each searcher, 
to form the query. The topics within each profile reflect the general interests of the 
searcher. The terms resident in each topic are the specifics of the general interests, 
and these terms are used to rank stories depending on the profile selected. The profile 
search therefore brings stories that are of general interest to the attention of the 
searcher. Stories on topics that not of interest are pushed towards to bottom of the list. 
The higher ranked stories form a potentially relevant information space in which to 
begin deeper investigation. More details on this feature are given in Section 2.4. 

NewsFlash is an adaptive search system that expands the initial query in light of 
relevance information gleaned from searcher interaction. Searchers may have prob- 
lems in adequately expressing their information needs and the terms selected by the 
system are those that NewsFlash perceives are of the highest utility for their current 
search. The next section describes how these terms are automatically selected. 

2.3 Term Selection 

When the searcher passes the mouse pointer over a thumbnail they are presented with 
a query- specific textual summary of the story that shot describes. The searcher then 
has the option of clicking the video (to watch it play) or moving the pointer away. In 
our approach we regard the playing of a video as a strong indication of relevance. On 
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every such indication we select possible expansion terms from the summary of the 
story just played and use these terms to reorder the list of stories. 

The playing implies interest in the content of the video. When the searcher makes 
the choice to play, she does so having seen the first shot of the video, the middle shot 
of the video and a four sentence summary of the content of the video subtitles, biased 
towards the searcher’s initial query. It is assumed that these three things are suffi- 
ciently indicative to allow the searcher to make an sound assessment of potential 
relevance. 

To rank possible expansion terms, NewsFlash uses the wpq algorithm [11] shown 
in Equation 1 . For any term t, where N is the total number of summaries, n. is the 
number of summaries containing t, R is total number of relevant summaries and r t is 
the total number of relevant summaries that contain t. R and r are based on relevance 
assessments and are therefore prone to increase as searchers interact with the systems. 
Possible expansion terms only come from the document summary generated by the 
system, and it is assumed that N is set to equal the total number of stories that are 
detected, and therefore ready to be retrieved by the system. 
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Equation 1. wpq formula 



All terms are ranked according to their score and the new query representation - 
the original terms plus the new expansion terms - is presented to the searcher and 
then used to reorder the list of stories. At this point it is possible for stories to leave 
the top 12 shots displayed, as the ranking of stories previously resident outside the 
top 12 is boosted by the inclusion of the query expansion terms. The expansion terms 
selected will have a high discriminatory power between relevant and non-relevant 
stories and will therefore promote those that are relevant, whilst demoting those that 
are not. The reordering uses the top six terms plus the searcher’s original query, and 
as mentioned earlier, uses a best match tf.idf weighting scheme to rank the stories. 
The searcher has the option to undo the effects of any reordering operation. The sys- 
tem reorders then offers the option to undo, rather than simply offering the option to 
reorder. The authors feel that it is better to show searchers the output of the action and 
let them decide on the value of the action, rather than let them rely on the perceived 
value of the potential action, a judgment that may be tainted by previous experience. 

We fully acknowledge the uncertainty surrounding the use of implicit evidence 
and as a result, we assign the six expansion terms a reduced weight in the new query. 
In our system, the expansion terms are given a weight half that of their original tf.idf 
weights. Equation 2 shows this, where q l is the expanded query, q 0 is the initial query, 
0 is the normalising factor (set depending on the reliability of the implicit evidence) 
and q e is an expansion term 1 . 



1 It is worth noting that wpq is only used to rank potential expansion terms, and plays no role in 
expansion term weighting. 
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<h =<l0 +0 (^ O 

Equation 2. Term re-weighting 

The decision to set n = 6 was based on the authors’ previous experience with simi- 
lar systems [14] [15], where this number of expansion terms has proven sufficient. 

The effect of the scoring is cumulative. Stories that start with a low initial ranking 
can ‘bubble up’ to the top of the list in response to a series of expanded queries that 
match the story content. All expansion term scores are reset to zero when the searcher 
enters a new query and new result set is generated. The adaptive component therefore 
only operates in the interaction between searcher-controlled query iterations. 



2.4 Searcher Profile 

Whilst the modified query created automatically by the system during interaction is 
only temporary, and disappears when the searcher submits a new query, the system 
also provides a means of personalisation that allows a searcher to express their long- 
term general interests. NewsFlash supports the creation of a searcher profile, unique 
to each searcher. Within these profiles are general topics that can be created by the 
searcher, and within the topics are keywords, either entered explicitly, or recom- 
mended by the system using the methods described in the previous section. When the 
system reorders the list of stories, it also displays the top six expansion terms to the 
searcher, who can then decide which terms to store in their profile and which to dis- 
card. Fig. 3 shows an example of what is shown on the results interface directly 
above the list of thumbnails. In this figure the searcher had previously viewed stories 
on the latest Iraq war. 



We recommend adding the words: borders 0 , arrive 0 , wisdom 0 , planes 0 , centre 0 , raf 0 , to profile "war" 1 Add To Profile ] 

Fig. 3. Profile term recommendation 

Unlike the story reordering described in the previous section, the searcher is given 
complete control over which terms are added to their profile. 

There are two temporal states in the system, and a term is resident in one state at 
any particular time. The states are a transient, temporary set of terms selected based 
on their estimated worth (wpq score) for each search session and a permanent, stored 
set of terms for each topic in the searcher’s profile that traverses multiple search ses- 
sions. Terms can move between states, although searchers have control over which 
terms make this move (Fig. 4). 

NewsFlash can make recommendations about terms to add to the profile based on 
videos the searcher has watched. The system therefore only recommends potential 
terms, it does not update the profile without searcher consent. 
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space 



transient 
expansion 
term space 



Fig. 4. Temporal term states in NewsFlash 



3 Evaluation 

In this section we describe the evaluation of NewsFlash. In particular we detail the 
three versions of the NewsFlash system used for this evaluation, the participants in- 
volved, the tasks undertaken, the techniques used to evaluate the searcher profile and 
the experimental methodology employed. 

3.1 Research Hypotheses 

We endeavour to test whether (HJ the reordering of the list of stories and (H.) the 
addition of terms to a permanent profile, leads to increased searcher satisfaction and 
more effective searching. 

3.2 Systems 

Three systems were used in the evaluation, each with a different level of adaptive 
functionality. As the versions of the NewsFlash system were specifically developed 
for the evaluation, the level of functionality depended directly on the research hy- 
potheses. System 1 was our experimental baseline, was not adaptive, and therefore 
did not carry out any operations automatically on the searcher’s behalf. System 2 
chose terms from summaries viewed by the searcher and used potentially useful non- 
query terms to reorder the list of shots (for H t ). System 3 (effectively the complete 
NewsFlash system) reordered the list of stories, as in System 2, but also recom- 
mended potentially useful terms to add to the searcher’s saved profile under an ap- 
propriate topic heading (for H,) The topic headings were unique to each searcher. 

3.3 Subjects 

We recruited 9 subjects, who were all male, and had an average age of 21.7 years. 
The subjects were all educated to graduate level and used computers and search tools 
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frequently. Subjects were also regular viewers of television news programmes. 6 out 
of the 9 subjects watched Channel 4 news (the source of the test data) on a regular 
basis. 



3.4 Tasks 

In our evaluation each subject was asked to complete three search tasks. Each search 
task was placed within a simulated work task situation, [1], This technique asserts 
that subjects should be given search scenarios that reflect real-life search situations 
and allow the searcher to make personal assessments on what constitutes relevant 
material. An example task is shown in the appendix. 



3.5 Evaluating Searcher Profiles 

System 3 allowed the searcher to save some or all of the terms recommended by the 
system for use in later queries. These terms are stored in the searcher’s profile under 
the appropriate topic headings. For example, the terms “Gulf Iraq Saddam” may be 
put under the topic heading War. To cater for contextual ambiguities, the searcher had 
the choice of which terms are retained in the profile. 

To test the effectiveness of the searcher profile (in terms of applicability in subse- 
quent search sessions) the subjects were invited to return the day after their first ses- 
sion, and attempt a second task on System 3 only. Systems 1 and 2 did not implement 
the profile updating, so there was no need to test this feature on these systems. The 
follow-up task was varied according to the subject's level of success at the original 
task (attempted on the first day). If the subject was successful, they were asked to 
attempt a task that is a continuation of the original one (i.e. they are assumed to know 
the answer to the original task), if they do not succeed they re-attempt the original 
one. The follow-up task for the example given in Section 3.4 is shown in the appen- 
dix. 




first day second day 

Fig. 5. Allocation of the follow-up task 



Fig. 5 shows the allocation of the follow-up task ( r, ) depending on the success (S) 
or failure (F) of the original task (f 0 ). 
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3.6 Data Set 

The data set for this evaluation was the Channel 4 television news 2 from two consecu- 
tive days, the 25th and 26th of February 2003. This comprised of 2 hours of footage. 
The small data set was used to test our approach and does not reflect the length of 
video that NewsFlash could adequately handle. 

3.7 Methodology 

The experiment followed a within-subjects, repeated measures design (i.e. each par- 
ticipant completed 3 tasks in total, one on each of the 3 search systems). Tasks and 
systems were allocated according to a Greco-Latin square design. To evenly distrib- 
ute fatigue and learning effects we rotated both the order in which tasks and systems 
were presented across participants. The negative effect of task-bias minimised by this 
rotation. Each subject was given 10 minutes to complete each task, although the sub- 
jects could terminate the search early if they felt they had completed the task. The 
time restriction ensured consistency between subjects. 

The subjects were welcomed and given a short tutorial on the features of the three 
systems being tested. We also collected background data on aspects such as the sub- 
jects' experience and training in online searching. After this, subjects were introduced 
to tasks and systems according to the experimental design. Subjects were instructed to 
attempt the task to the best of their ability and write their answer on a sheet provided. 
A search was seen to be successful if the searcher felt they had succeeded in their 
performance of the task. 

When they completed a search, the subjects were asked to complete questionnaires 
regarding various aspects of the search. We used semantic differentials, Likert scales 
and open-ended questions to collect this data. 

4 Results and Analysis 

In this section we present preliminary results from our system evaluation. In particu- 
lar, we focus on results pertinent to each of our two research questions: the worth of 
the adaptive document reordering and the worth of the searcher profile. Tests for 
statistical significance will be given where appropriate with p < .05, unless otherwise 
stated. S p S 2 and S 3 denote System 1, System 2 and System 3 respectively. We ana- 
lyse aspects of the results individually. M is used to denote the mean, and 5-point 
scales are used throughout, with a value of 1 reflecting more agreement. 

4.1 Task Success 

Task success was measured from the perspective of the searcher and therefore 
searchers decided whether they had completed a task. This was felt to be representa- 



2 Channel 4 is a UK terrestrial television channel. 
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tive of most real-life search situations 3 and fitted well with the use of simulated work 
task situations. The differences in levels of task success between the systems were 
marked. When using the experimental baseline (System 1), searcher’s completed 80% 
of their tasks, with 63% on System 2 (reordering only) and 53% on System 3 (com- 
plete NewsFlash). The differences between all systems are not significant with paired 
'/-tests (7j 6 ), the significance values are shown in Table 1. 



Table 1. Inter-system comparison of task completion rate 



Comparison 


Task completion rate 


T-value 


Significance 


5, vs. S 2 


80% vs. 67% 


.57 


.426 


5, vs. s] 


80% vs. 53% 


1.34 


.130 


S 2 vs.S, 


67% vs. 53% 


.53 


.473 



4.2 Reordering of Stories 

The stories were reordered every time the system received a positive relevance 
indication (i.e. playing of a video). This reordering shaped the result set to reflect the 
perceived current information needs of the searcher. To test whether the reordering 
was effective, subjects were asked to indicate on a Likert scale how helpful this op- 
eration was. The results showed that the results significantly differed from the median 
value (i.e. 3, M = 2.286) using a paired '/'-test (7j f = 2.38, p = .032). Subjects felt that 
the reordering of the stories by the system on their behalf helped, rather than hindered 
their seeking. 



4.3 Searcher Profiles 

NewsFlash maintained a stored profile of searcher’s general interests over a period of 
time. To test this, searchers participated over two days. The task they attempted on 
the second day was dependent on their level of success with the task on the first day. 
In this regard, participants were asked to rate, on a Likert scale, the worth of being 
able to store terms during the original task and reuse them for the follow-up task. The 
results significantly differed from median ( M = 1.714, 7j 6 = .98, p = .176). This per- 
haps shows that the searcher profiles can indeed be useful, assuming the follow-up 
tasks are in some way related to the original task. Through dividing the profile into a 
number of topics, and allowing searchers to place terms within these topics we cater 
for the potential diversity of long-term needs and intentions. Ultimately the searcher 
has control over which topic is used in NewsFlash at a certain time, so the terms cho- 
sen for the profile search will relate to their current general interests. The system will 
suggest specific additions to the profile based on searcher interaction. 



3 See Reid [10] for a counterexample. 
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4.4 Worth of Expansion Terms 

When selecting terms on behalf of the searcher it is important that these terms are of 
use in their search. To test this aspect of the NewsFlash system, participants were 
asked to rate (using a semantic differential) whether the terms added to their original 
query were useful always, occasionally or never. The responses were mixed (M = 3) 
and not significantly different from the median value with a paired 7-tcst (7j 6 =.65, p 
=.375). The results seem promising if we allow for the uncertainty inherent in implicit 
techniques of this nature and the relative simplicity of the methodology used to select 
expansion terms. 

5 Discussion 

Selecting worthwhile terms on behalf of searchers relies on an ability to predict their 
information needs to a very fine level of granularity. The approach used in News- 
Flash to detect searcher needs is coarse-grained and as such is likely to produce a 
certain number of erroneous terms (i.e. not all terms in a summary will be relevant). 
We acknowledge this uncertainty in Equation 2, assigning a reduced weight to the 
terms selected implicitly. It is envisaged that as the system’s ability to perceive in- 
formation needs improves through modifications to our approach, the normalisation 
weight 9 will tend to one, and the terms selected by the system will play a more im- 
portant role in the new query. 

As is evident in the previous section, many of the differences in the results pre- 
sented were not statistically significant. This is perhaps due to the small number of 
experimental participants in this preliminary study. The most disconcerting result is 
that whilst subjects liked the query expansion, it appears to have reduced their 
perception of task success. This could be for a number of reasons, perhaps the most 
likely is the unfamiliarity of NewsFlash’ s adaptive features. This is supported by 
increased levels of task success on the follow-up task, over the original task. Despite 
the pre-experiment tutorial, the original task was the first time searchers had an op- 
portunity to attempt a ‘real’ task in a time-constrained context. We posit that as 
searchers become more familiar with the adaptive features they will become more 
effective in their seeking. 

The reordering of the stories and the recommendation of terms for addition to the 
permanent searcher profile were well received by subjects. This is a promising result, 
the experimental baseline was strict (i.e. used the same search interface as the adap- 
tive systems), so such any responses that suggested performance gains or increased 
levels of searcher satisfaction are indeed worthwhile. 

During the experiments, we also logged task completion time and number of query 
iterations for each of the three systems. There were no significant differences between 
any of the systems for each of these measures. Finding that the adaptive system per- 
formed just as well as the baseline showed that it did not hinder searchers, but it did 
not help them sufficiently (in this regard at least) to reduce task completion times by a 
significant level. 
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Due to the small sample size, it was thought that the most interesting results may 
come from the informal comments of searchers as they searched. These comments 
proved promising, searchers liked the complete NewsFlash system substantially more 
than the other systems. The negative comments concerned small interface issues, 
although one participant did raise what was thought to be a particularly valid concern. 
He suggested that NewsFlash “ could do with more descriptions of what was happen- 
ing ”. This is an important point, if adaptive systems are going to work on behalf of 
the searcher, it seems reasonable that they explain their actions rather than detaching 
themselves completely. At present, adaptive search systems assume a ‘black box’ 
approach to assisting those they are meant to help. Explanations may open this box, 
help engender trust in NewsFlash’ s actions and perhaps bridge the gap between 
searcher and system. 



6 Conclusions 

In this paper we have presented NewsFlash, an adaptive querying system for search- 
ing online repositories of stored TV news footage. Unlike similar systems developed 
to search such corpora, our approach implicitly refines and enhances searchers’ initial 
queries based on their interaction. The system adapts to the needs of searchers with- 
out requiring the searcher to explicitly define these needs or changes in them. We 
proposed a means of extracting and weighting terms for implicit query expansion 
during the interaction with the results of a multimedia search system. We use this 
query for two operations; reordering the list of stories and suggesting terms to be 
added to a searcher’s unique profile. 

We carried out a preliminary evaluation of the system with real searchers and real- 
istic information seeking scenarios and adopted a novel means of testing the worth of 
stored searcher profiles. The results, whilst not being statistically significant, ap- 
peared promising. Searchers felt that terms selected by the system were useful and 
storing terms in the searcher profile was helpful for future searches. The results also 
indicated adaptive interface support may be of most use to novice searchers who 
typically have problems conceptualising their information needs. 

Systems that are able to adapt to the needs of searchers can remove the cognitive 
burden and expense of term selection and query reformulation and offer them addi- 
tional support in their seeking. Systems of this nature have the potential to improve 
the search experience for struggling searchers everywhere. Future versions of the 
NewsFlash system will focus on better identifying information needs and explaining 
its actions to those who use it. 
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Appendix 

Original Task 

You are considering travelling abroad , you have however been getting strange letters 
through the post, informing you that you are defaulting on credit card accounts you 
don ’t have. You are concerned that someone may have stolen your identity. 

Try to find if there are any similar cases in the news today. 

Follow-up Task 

Derek Bond’s family were interviewed after his release. What was their opinion on 
the actions of the British government in this case? 
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Abstract. Methods for the automatic categorization of documents are usually 
based on a simple analysis of the considered document collection. User specific 
criteria, e.g. interests in specific topics or keywords, are usually neglected. 
Therefore, the resulting categorization frequently does not fulfil the user expec- 
tancies. In prior work we had developed an approach to cluster document col- 
lections by growing self-organizing maps that adapt their structure automati- 
cally to the structure and size of the underlying document collection. In this 
paper, we present an approach to improve the obtained clustering by consider- 
ing user feedback (in the form of drag-and-drop) to adapt the underlying topol- 
ogy and thus the categorization of documents by the self-organizing map. Fur- 
thermore, we briefly present applications for image and text document col- 
lections. 



1 Introduction 

Today large archives of text, images, audio and/or video sequences are available. One 
main problem in accessing these archives is that the objects are very often not or not 
appropriately classified by e.g. keywords. Therefore, the retrieval of specific objects 
is usually very expensive or even impossible. Meanwhile, a number of indexing 
methods have been proposed that extract characteristics (features) of multimedia ob- 
jects automatically. Almost all of these methods are based on a numerical data space, 
i.e. an index is computed, which is a numerical feature vector that describes the ob- 
jects. For instance, text documents are frequently indexed by selected key terms and 
then each term is represented by a number in a dictionary vector [14]. Other examples 
are images that can be described by color or texture histograms that are also repre- 
sented as numerical vectors [2, 9]. 

In prior work, we have implemented a document retrieval system that provides be- 
sides standard keyword search techniques a growing self-organizing map to visualize 
the underlying data collection [TO, 11]. These maps are built based on the provided 
collection taking into account not only neighboring documents, but also introducing 
new words to the index, if they are considered as relevant. All this depending com- 
pletely on the learning context (i.e. the database used for training). Thus, the system 
automatically enables a user to visualize, search and navigate in arbitrary document 
collections that can be represented by numerical vectors. 

Unfortunately, the existing methods are not aware of either the search context or 
the user preferences. Therefore, we present in the following an extension of our sys- 
tem that makes it aware of user-feedback in order to adapt the classification of docu- 
ments. In the following we give a brief introduction to self-organizing systems and 
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the used document retrieval system. Then, we outline two complementary learning 
methods based on user-feedback and a generalized version of these methods. We 
conclude with a brief discussion of a interactive retrieval systems for images and text 
collections, in which the discussed algorithms have been implemented. 

2 Self-organizing Systems 

Self-organizing maps [5] are a special architecture of neural networks that clusters 
high-dimensional data vectors according to a similarity measure. The clusters are 
arranged in a low-dimensional topology that preserves the neighborhood relations in 
the high dimensional data space. Thus, not only objects that are assigned to one clus- 
ter are similar to each other (as in every cluster analysis), but also objects of nearby 
clusters are expected to be more similar than objects of distant clusters. Usually, two- 
dimensional grids of squares or hexagons are used. Although other topologies are 
possible, two-dimensional maps have the advantage of an intuitive visualization and 
thus good exploration possibilities. 

Several applications have been proposed for the usage of self-organizing maps for 
classification and exploration of collections of text documents, images or speech data 
[1, 7, 8], Our approach for document retrieval [4] combines conventional keyword 
search methods with several SOM-based views of the document collection, to allow 
interactive exploration. 

In this paper, we will focus on different extensions of the basic algorithm in order 
to allow the integration of user-feedback. Other previous extensions of the model 
include the implementation of growing self-organizing maps, which eliminates the 
necessity to define the map size manually, leading to more appropriate mappings of 
the objects [10, 11], In [12] we discussed how the system could be applied to collec- 
tions of objects other than text documents. 

2.1 Using the Maps 

The SOM algorithm can be applied to arbitrary document collections, as far as a vec- 
tor description of the considered objects is given. However, it is essential that the 
vector consists of object features that represent the characteristics of the objects ap- 
propriately and that the used vector-similarity translates a real similarity between the 
objects. More on the construction of the maps, for instance on the training algorithm, 
can be found in [6], Here we focus on the proposed extensions. 

A trained map represents a clustering of the object collection. A browsable list of 
objects can be assigned to every grid cell. A screenshot of our implementation for text 
document retrieval is shown in Fig. 1. Based on this architecture two main querying 
possibilities are offered: by keyword and by example (or content). 

2.2 Query by Keyword 

For a collection of text documents - or for a multimedia database that provides be- 
sides numerical features textual information about the objects - a keyword search can 
be performed resulting in a list of objects that are ranked according to their similarity 
to the given keywords. 
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Fig. 1 . A text document retrieval system based on the proposed approach [4], 

Since a user might also be interested in objects dealing with related keywords the 
document map - a self-organizing map that is trained by the textual features of the 
objects - can additionally be used. This map provides a visualization of the search 
results. Therefore, the search keywords can be associated with colors. The nodes of 
the document map are highlighted with blends of these colors to indicate how well the 
documents assigned to a node match the search terms. This feature enables the user to 
see how wide the results of his query are spread and thus can give him an idea, if he 
should further refine the search. E.g. if the highlighted nodes build clusters on the 
map we can suppose that the corresponding search term was relevant for the 
neighborhood relations in the learning of the self-organizing map. In this case the 
probability to find documents with similar topics in adjacent nodes can be expected to 
be higher. 

Furthermore, if the user selects a document in a result list, the node in the map to 
which this document is assigned is marked and the user can search for similar docu- 
ments in the surrounding area. The labels of the nodes, which classify the documents 
that are assigned to the specific nodes, can be used as hints for navigation. 



2.3 Query by Example 

If the user is looking for objects similar to a given sample object (associative search), 
this sample object can be directly mapped to the document map. The grid cell that is 
selected as winner unit refers to the objects that are most similar to the provided sam- 
ple. The user can also use this winning unit as a starting point for navigating to similar 
objects in its neighborhood. 
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To improve the visualization, the map can be colored with respect to the distance 
to surrounding grid cells [3], In this case, e.g. darker colors visualize great differences 
to the neighboring cell prototypes while lighter colors represent homogenous regions. 
Another method for coloring is to visualize the distance between every cell and the 
sample object provided [7]. Both coloring approaches give the user hints about the 
structure of the underlying database. 



3 Incorporating User Feedback 

To allow the user to give feedback information, the tool was extended such that a user 
can drag one or several objects from one node of the map to another, which in his 
opinion is more appropriate for the considered objects. Furthermore, the user can 
mark objects that should remain at a specific node, thus preventing the algorithm from 
moving them together with the moved object during the re-computation of the docu- 
ment allocations. 

In the following, we describe two user-feedback models that have been designed to 
solve specific clustering problems [13]. These two approaches modify the underlying 
similarity measure by increasing or decreasing the importance of individual features. 



3.1 Learning Global Feature Weighting 



For the implementation of a global feature weighting scheme, we replaced the similar- 
ity function used for the computation of the winner nodes by a weighted similarity 
measure. Therefore, in case of the Euclidean distance, the distance of a given feature 
vector to the feature vectors of the prototypes is computed by 



X w ' < x s~yi ) 2 



( 1 ) 



where w is a weight vector, y k the feature vector of an document k and x the proto- 
typical feature vector assigned to a node 5. For the scalar product, which is usually 
used to compute the similarity between text documents, we obtain 



_ — A' A- — with x‘. = w‘ ■ x\ , yl- =w‘ -yl Vi- (2) 

I x s - 1 ■ | y k . | 

We update the global-weight vector w based on the differences between the feature 
vectors of the moved document and the vectors of the origin node; and of the target 
node. The goal is to increase the weights of similar features between the document 
and target node and to decrease the weights of similar features between the document 
and its current node. And symmetrically decrease and increase the weights of dissimi- 
lar features. 

Let y. be the feature vector of an document i, s be the source and t the target node, 
x s and x t be the corresponding prototypes, then w is computed as described in the fol- 
lowing. First we compute an error vector e for each object based on the distance to the 
prototypes 
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e), = \d k jlVk , where d ^ • (3) 

pi-til 

If we want to ensure that an object is moved from the source node to the target 
node using feature weights, we have to assign higher weights to features that are more 
similar to the target than to the source node. Thus for each object we compute the 
difference of the distance vectors 

fi=e si -e,i (4) 

The global weight vector is finally computed iteratively. For the initial weight vec- 
tor we choose w (0) = w n where w J is a vector where all elements are equal one. Then 
we compute a new global weight vector w Hl) by doing a by element multiplication: 

v/ (t+1) = w kW ■ w, , \/k with w t = (Wj + 77 ■ / ; ), (5) 

where 77 is a learning rate. The global weight is modified until - if possible - all 
moved objects are finally mapped to the target node. A pseudocode description of this 
approach is given in Fig. 2. 

Obviously, this weighting approach also affects the assignments of all other docu- 
ments. The idea is to interactively find a feature weighting scheme that improves the 
overall classification performance of the map. Without a feature weighting approach 
the map considers all features equally important. 

Notice that the global weights reflect the user preferences and therefore can be 
used to identify features that the user considers important or less important. If, for 
example, text documents are used where the features represents terms, then we might 
get some information about the keywords that the user seems to consider important 
for the classification of the documents. 

Compute the weight vectors w±; 

If the global weight vector w is undefined 
create w; 

initialize all elements to one; 
end if; 
cnt = 0 ; 

Repeat until all documents are moved or cnt > max 
cnt ++ ; 

For all documents i to be moved do 
Compute the winning node n for i; 
if N # ti (target for i) then 
w k := w k ■ w k ,\/k ; 
normalize w; 
end if; 
end for; 
end repeat; 

Fig. 2. Pseudocode description of the computation of a global weight. 



3.2 Learning a Local Weighting Scheme 

The global weighting scheme emphasizes on general characteristics, which support a 
good overall grouping of the data collection. Unfortunately, this may lead to large 
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groups of cells with quite similar documents. In this case some features - which are of 
less importance on a global scope - might be useful for distinguishing between local 
characteristics. Thus, modifying locally the weights assigned to these features might 
improve the assignment of the documents to more specific local classes. 

The proposed learning method used is quite similar to the method described above. 
However, instead of modifying the global weight vv, we modify local weights as- 
signed to the source and the target nodes (noted here w s and vv,). 

As before we first compute an error vector e for each document based on the dis- 
tance to the prototypes, as defined in equation (2). 

Then we set all elements of the weight vectors w s and w, to one and compute local 
document weights n\ and w H by adding (subtracting) the error terms from the neutral 
weighting scheme w r Then we compute the local weights iteratively similar to the 
global weighting approach: 

vvf' +1) = w k s w ■ w si yk, with w si =w 1 +7]- e si (6) 

and 

wf (r+1> = wf m ■ w„yk, with w ti =w 1 -rj- e ti (7) 

where rj is a learning rate. The weights assigned to the target and source node are 
finally normalized such that the sum over all elements equals the number of features 
in the vector, i.e. 

5X=5>?=Xi ^ 

k k k 

In this way the weights assigned to features that achieved a higher (lower) error are 
decreased (increased) for the target node and vice versa for the source node. 



3.3 A Generalized Learning Model 

With the local approach we just modified weighting vectors of the source and target 
nodes. However, as adjacent map nodes should ideally contain similar documents, 
one could demand that the weights should not change abruptly between nodes. Thus, 
it is a natural extension of this approach to modify the weight vectors of the neighbor- 
ing map units accordingly with a similar mechanism as in the learning of the map. In 
the following, we present such an extension. 

As for the local approach we have a weighting vector per node. Then - as before - 
we start by computing an error vector e for each object based on the distance to the 
prototypes, as defined in equation (2). 

Based on the error vectors e weight vectors of each node n are computed itera- 
tively. For the initial weight vector w k n m> we choose vectors where all elements are 
equal to one. We then compute a new local weight vector for each node by an elemen- 
twise multiplication: 

w k n (,+1) = w ^-w ni yk with w nt =w i +i 1 {g r m e, l - g r ln e n ), (9) 

where 77 is a learning rate and where g r sn and g r m are weighting values calculated 
using a neighborhood function. Depending on the radius r of the neighborhood func- 
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tion, the resulting weighting scheme lies between the local approach ( r = 0) and the 
global approach (r = oo) - thus all weights are equal - discussed above. 

Because two quite similar prototypes could be projected in slightly distant cells of 
the map, e.g. in very dense areas, the neighborhood function should be based on the 
actual topology of the map. Here we propose to use a linear decreasing function for 
g r sn , which equals one for the source node and equals zero at the hull defined by the 

radius r. The same holds for the target node and g r m (see also Fig. 3). Notice that 
more refined functions can be used as for instance Gaussian-like functions. 

As above, all weights vectors are modified until - if possible - all moved objects 
are finally mapped to the target node. 



Fig. 3. Neighborhood function centered on target node (decreasing to zero for r). 

This weighting approach affects the assignments of documents of neighboring 
cells. The influence of the modification is controlled by the neighborhood function. 
The idea is that a local modification has a more global repercussion on the map. In 
this way we can interactively find a feature weighting scheme that improves the clas- 
sification performance of the map. 



4 Application Examples 

The discussed learning methods were implemented in two prototypical implementa- 
tions: One interactive tool for image retrieval and one for text retrieval. In the follow- 
ing we briefly discuss two applications of the respective models. 

4.1 An Image Retrieval System 

As sample data set we used a picture database provided from the California Depart- 
ment of Water Resources. The database contains more than 17,000 color 
images of scenery, of buildings, animals, people, etc. They are available from the 
web server of the Digital Library Project, University of California, Berkeley 
(http://elib.cs.berkeley.edu/). The pictures are provided as jpeg images, each picture 
has a size of 192x128 pixels. 
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For the evaluation of the tool we selected a small subset of 186 images and com- 
puted 303 dimensional feature vectors for each picture based on histograms computed 
on the pixel colors in the HSV color model. Then we trained the growing self- 
organizing map starting with 4x4 neurons. A resulting map and the learning environ- 
ment are shown in Fig. 4. 

After learning of the image map, we obtained at the neighboring nodes (8,6) and 

(9.6) the image groups shown in Fig. 5. Unfortunately, there are still very similar 
images distributed over both groups. To improve the grouping we moved the image 
#98 showing a tank from node (8,6) to node (9,6). Then we retrained the system using 
all three discussed user feed-back approaches. 

In Fig. 6 the obtained image groups after retraining are shown. As we can see, the 
local weighting approach is more selective and only moves image #98 to the new 
node. The global weighting approach adds images to both groups, while to the target 
node (9,6) also a similar tank image (#165) was added. 

The color histograms of the moved image (#98) and the prototypes assigned to the 
source node (8,6) and the target node (9,6) are shown in Fig. 7. Furthermore, the 
learned weighting vectors are shown. As expected, the global weighting vector in- 
creases features, where the error between the image feature and the source prototype 

(8.6) is larger than between the image feature and the target prototype. Also the local 
weighting vectors enhance (dampens) errors between the image and the source (tar- 
get) prototype and vice versa. 

If we continue to move several pictures between groups, we can see that the ap- 
proach is able to reflect the desired changes more specifically than by moving just one 
picture. However, since the used image features only use color information to de- 
scribe the images, we cannot expect to obtain every grouping result we would like to 
have (for instance a semantic classification that do not corresponds to a color reality). 




Fig. 4. The learning environment: Resulting map and some grouped images. The map is col- 
ored according to the similarity between a selected image and the prototypes assigned to the 
map nodes. 
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Fig. 5. Image groups discovered by the learning process of the growing self-organizing map. 




Fig. 6. Image groups after moving image #98 using the global weighting approach (left) and 
using the local weighting approach (right). 



4.2 A Text Retrieval Prototype 

Based on the implementation discussed in [10] a text retrieval prototype had been 
implemented that integrates the user feedback method discussed above. We applied 
the approach to a text database consisting of abstracts from the conference of the 
European Geophysical Society (EGS 2000; see also [4]) and emails of a mailing list 
on fuzzy systems (listproc@dbai.tuwien.ac.at). From each collection 1000 documents 
have been randomly selected and used for training of a growing self-organizing map. 

In order to give an idea of the effect of user feedback on the weight vector and thus 
the overall representation of the document collection, we depict in the following the 
feature vectors of the group prototypes, an object and the global weight vector after 
the object has been moved between the groups (see Fig. 8). As can be seen, the algo- 
rithm increased the weights of word stems that are used in the document and the tar- 
get node (flux , gravity, materi) and reduces the weights for terms that are used in the 
source node ( billion , debri, rough, sand), while the strength of the weight changes 
depend on the importance of the respective words in the belonging group prototypes 
and the object itself. 
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Fig. 7. Top: Features (histograms) for prototypes of node (8,6), (9,6) and the moved image 
(#98) - Bottom: Learned weighting vectors. 
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Fig. 8. Visualization of the modifications on the weight vector for selected features (words): 
Document 181 'Evolution of the surface of mars in the light of mars' moved from node (11,12) 
to node (11,11). The word stems used in the document are labeled. 
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Thus, besides the use in the system itself in order to obtain a user specific mapping, 
the weight vector - and thus the user specific similarity measure - can be interpreted 
in order to get some insight into the user interests. Furthermore, the obtained weight- 
ing scheme can directly be used in order to rank search results with respect to user 
specific interests. 



5 Conclusions 

Self-organizing maps provide valuable means for visualization and exploration of any 
object collection that can be described by numerical feature vectors. The combination 
with traditional search and coloring methods allows the design of interactive and user- 
friendly object retrieval tools. 

The presented learning approaches incorporate user-feedback and thus allowing to 
refine the map, which was initially trained in an unsupervised manner, with respect to 
user specific interests. The initial clustering that only depends on the definition of 
describing features and the similarity measures is adjusted by gathered information 
about user specific grouping criteria. Thus, a ‘user desired’ similarity measure is ob- 
tained by modification of the importance of certain features using weights. As shown 
above, this weighting scheme can also be interpreted in order to obtain information 
about user interests. Furthermore, the weighting scheme can directly be used in rank- 
ing methods in order to obtain result lists that are ordered with respect to user specific 
interests. 

Using the discussed learning approaches, an interactive retrieval tool could be de- 
veloped that adapts its visualization, clustering and ranking to user specific needs. 
Since the obtained output reflects the expectancies of a user, these techniques can 
increase the user acceptance as well as the retrieval performance. 



References 

1. Allinson, N., Yin. H., Allinson, L., and Slack, J. (eds.). Advances in Self-Organizing Maps, 
Proc. of the third Workshop on Self-Organizing Maps (WSOM 2001), Springer-Verlag, 
Berlin, 2001. 

2. Gudivada, V., and Raghavan, J. V., Special issue on content-based image retrieval systems, 
IEEE Computer Mag. 28(9), IEEE, 1995. 

3. Honkela. T., Kaski, S., Lagus, K., and Kohonen, T., Newsgroup Exploration with the 
WEBSOM Method and Browsing Interface, Technical Report, Helsinki University of Tech- 
nology, Neural Networks Research Center, Espoo, Finland, 1996. 

4. Klose, A., Niirnberger, A., Kruse. R., Hartmann, G. K., and Richards, M., Interactive Text 
Retrieval Based on Document Similarities, Physics and Chemistry of the Earth, Part A: 
Solid Earth and Geodesy, 25(8), pp. 649-654, Elsevier Science, Amsterdam, 2000. 

5. Kohonen, T., Self-Organized Formation of Topologically Correct Feature Maps, Biological 
Cybernetics, 43, pp. 59-69, 1982. 

6. Kohonen, T., Self-Organization and Associative Memory, Springer-Verlag, Berlin, 1984. 

7. Kurimo, M., Indexing Audio Documents by using Latent Semantic Analysis and SOM, In: 
Oja, S., and Kaski, E. (eds.), Kohonen Maps, pp. 363-374. Elsevier, Amsterdam, 1999. 

8. Laaksonen, J., Koskela, M.. and Oja, E., PicSOM: Self-Organizing Maps for Content- 
Based Image Retrieval, In: Proceedings of IEEE International Joint Conference on Neural 
Networks (IJCNN’99), IEEE. Washington, DC, 1999. 




98 Andreas Niimberger 



9. Narasimhalu, A., Special issue on content-based retrieval , ACM Multimedia Systems, 3(1), 
1995. 

10. Niirnberger, A., Interactive Text Retrieval Supported by Growing Self-Organizing Maps, 
In: Ojala, T. (edt.), Proc. of the International Workshop on Information Retrieval (IR 

2001) , pp. 61-70, Infotech, Oulu, Finland, 2001. 

11. Niirnberger, A., and Detyniecki, M., Content Based Analysis of Email Databases Using 
Self-Organizing Maps, In: Proc. of the European Symposium on Intelligent Technologies 
(EUNITE 2001), Verlag Mainz, Aachen, 2001. 

12. Niirnberger, A., and Klose, A., Interactive Retrieval of Multimedia Objects based on Self- 
Organising Maps, In: Proc. of the hit. Conf. of the European Society for Fuzzy Logic and 
Technology (EUSFLAT 2001), pp. 377-380. De Montfort University, Leicester, UK, 2001. 

13. Niirnberger, A., and Klose, A., Improving Clustering and Visualization of Multimedia Data 
Using Interactive User Feedback, In: Proc. of the 9th International Conference on Informa- 
tion Processing and Management of Uncertainty in Knowledge-Based Systems (IP MU 

2002) , pp. 993-999, 2002. 

14. Salton, G., Allan, J., and Buckley, C., Automatic structuring and retrieval of large text files, 
Communications of the ACM, 37(2), pp. 97-108, 1994. 




Automatic Keyword Extraction for News Finder 



Jose Luis Martmez-Fernandez 1 , Ana Garcfa-Serrano 2 , 
Paloma Martinez 1 , and Julio Villena 3 

1 Computer Science Department, Universidad Carlos III de Madrid 

Avda. Universidad 30, 28911 Leganes, Madrid, Spain 
{ j lmf erna, pmf }@inf. uc3m.es 

2 Computer Science Department, Technical University of Madrid 
Campus de Montegancedo s/n, Boadilla del Monte 28660, Spain 

{ agarcia}@f i . upm. es 

3 Department of Telematic Engineering, Universidad Carlos III de Madrid 
Avda. Universidad 30 
28911 Leganes, Madrid, Spain 
jvillena@it .uc3m.es 



Abstract. Newspapers are one of the most challenging domains for information 
retrieval systems: new articles appear everyday written in different languages, 
with multimedia contents and the news repositories may be updated in a matter 
of hours so information extraction is crucial to the metadata contents of the 
news. Further approaches of “smart retrieval” have to cope with multimedia and 
multilingual features as well as have to obtain really good precision features in 
order to reach a high degree of user satisfaction with the retrieved documents. 
The paper focus is the description of the automatic keyword extraction (AKE) 
process for news characterization that uses several linguistic techniques to im- 
prove the current state of the text-based information retrieval 1 . The first proto- 
type implemented focusing in the AKE process (www.omnipaper.org) is de- 
scribed and some relevant performance features are included. Finally, some 
conclusions and comments are given regarding the role of the linguistic engi- 
neering in the web era. 



1 Introduction 

In the Information Retrieval automatic process (IR), given an expression of an user 
information needs (user consult) and a set of documents, a collection of these docu- 
ments that are supposed to be relevant to the consult is selected. An ideal search en- 
gine will recover all the relevant documents (recall) and only the relevant ones (preci- 
sion). This conventional understanding of the Information Retrieval task first assumes 
that both the consult and the set of documents are written in the same language. 

Currently English is the language used broadly in documents at the web, but the 
number of documents in other languages are increasing daily. Even if some of current 

1 This work is funded by the European project OmniPaper (IST-2001-32174) and the Task 
Force “User Adaptive Search Interfaces” at the EUNITE Network of Excellence. 
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search engines have incorporated a translator, they are used after the search, not dur- 
ing the search. That means that cross-lingual retrieval, i.e. the consult and the search 
is done using different languages, is not possible in current IR systems because even 
if the consult is translated, the search remains monolingual. 

Another challenge for the IR task is when some of the information included in the 
documents is multimedia (draws, images and even links to other medias), given that 
with current IR, the relevance of the multimedia content to the user consult is com- 
pletely missed. So there are needed specific processes to cope with multimedia ex- 
pression and search of the information. Besides, not only the quality but also the 
response time is very important for the user, who demands to be aware of events as 
soon as they happen. So the further approaches of smart retrieval have to cope with 
multimedia and multilingual features as well as have to obtain really good precision 
features in order to reach a high degree of user satisfaction with the retrieved docu- 
ments. 

Newspapers are one of the most challenging domains for information retrieval sys- 
tems: new articles appear everyday written in different languages, the news reposito- 
ries may be updated in a matter of hours, and the contents are multimedia. In the 
repositories the news use to have some information (metadata) describing some as- 
pects of the contents and appearance as the identification of the headlines, list and 
description of the images included and others. So one challenge to IR in this domain 
is the extraction of information from (a) different parts of the news, (b) multimedia of 
contents, or (c) documents in different languages, in order to identify automatically 
the topics addressed and the category of the news and other features of the document. 

It is broadly assumed that the degree of precision of the current technologies for 
text-based IR has reached the upper threshold, so, for further enhancements of the 
results, an integration of new practices and an aggregation of new functionalities is 
needed. 




|-*< Uploading | 



Downloading 



I 



Fig. 1. Automatic processes schemata 



The OmniPaper project is devoted to create an intelligent and uniform entrance 
gate to a large number of European digital newspapers by means of a multilingual 
navigation and linking layer on top of distributed resources. Our goal in the project is 
to develop a text-based information retrieval system that will works as an interface to 
several digital newspapers. It will allow users to get news related to their original 
query modified by the system during retrieval, but also to browse into a hierarchical 
topic thesaurus and find the news that are relevant to their consult and are interesting 
to them. 
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The system has two separate functionalities (figure 1): management of user queries 
(the so-called download subsystem) and processing of documents (the so-called up- 
load subsystem). The core of the system is the Automatic Keyword Extraction proc- 
ess (AKE). A keyword in our work is any relevant word (or multi-word) to guide the 
information retrieval. The uploading process receives the documents and loads the 
database with them. In the uploading process there is a classification of the incoming 
documents with respect to some criteria. The criteria can be given by a taxonomy 2 of 
categories 3 . 

News articles are classified according to these categories and are loaded in the da- 
tabase with indexes related to the results of the classification, so that retrieval is im- 
proved. The document classification process is based in the keywords found in the 
documents. 

The downloading process receives a query and retrieves the documents relevant to 
it from the database. The end-user interface may allow expressing the query as given 
values of given fields (author, date, journal, but also categories), (boolean) combina- 
tions of them, or as free text in natural language, but the access to the database is only 
keyword-based. 

This paper describes the OmniPaper linguistic approach and the first prototype im- 
plemented focusing in the AKE process. In the OmniPaper project AKE has different 
subtasks during both download and upload main processes in which statistical tech- 
niques are combined with linguistic-based ones. 

The structure of the paper is as follows; Section 2 includes some details about the 
linguistic and statistical techniques to be used in Information Retrieval mentioning 
the used in the AKE prototype; section 3 is devoted to the description of the main 
processes in the OmniPaper project; section 4 is focused in the description of the 
automatic keyword extraction process; section 5 concerns the technical aspects of the 
performance results of the first evaluation. Finally, some conclusions are given. 

2 Information Retrieval Framework 

Expected advances in the processing of speech and written language are both crucial 
to allow a (nearly) universal access to the on-line information and services. On the 
other hand, as the importance of information extraction functionalities increases in 
systems that helps users in daily life, the human language technology is needed to the 
management of the big amount of on-information either in public domain or commer- 
cially. This situation is shared by the IR so it can be identified three main research 
lines to improve the current information retrieval technology: 

a) The first one concerns the user and his interaction with an information retrieval 
system, especially the issues of how to specify a query and how to interpret the 
answer provided by the system. 



2 A taxonomy is an schema of relations between concepts (e.g., a hierarchy of categories). A 
thesaurus is a concept definition normally including a taxonomy, but also synonyms, sample 
sentences, etc. 

3 In this work, we understand a category as a “word” used as a descriptor of a concept. 
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b) The second one is related to the characterization of documents and how it affects 
the information retrieval process. 

c) The third one is devoted to prove the benefits of using linguistic resources during 
the user consult or to the search itself. 

Our approach consists in the development of a framework with statistical and lin- 
guistic techniques in order to evaluate the benefits of different processes defined 
using both kinds of techniques. 

Up to know our work has been related with the statistical approach for AKE de- 
velopment and the integration of simple linguistic techniques to complement the 
statistical framework. The kind of morphological and semantic processing performed 
in a shallow way (not for a complete understanding the text) on the news and user 
queries, is done using linguistic resources for different languages (English, Spanish, 
German and French at the project). 

The update of the different databases of the system containing documents, meta- 
data from the documents, topic maps, and others required the coupling of the used 
statistical-based techniques with: 

a) available linguistic resources for every language involved in the project as 

• morphological taggers or analyzers in order to allow multilingual process, 

• stemmers for reducing words that differ only by suffixes to the same root or 
just into a sequence of a fixed number of letters, 

• taggers for identify the Part Of Speech (POS) tags of words in documents and 
queries, 

• syntactic analyzers or segmenters for phrase identification or multi-words as 
“the rights of the child”, 

• semantic lexicons and heuristics for proper names and entity recognition. 

b) lexical ontologies that are language and domain dependent for the incorporation 
of semantic and pragmatic processes. 

Also not available linguistic resources has been developed in the project and in- 
cluded in the complete processes to improve the search as an entity recognizer (per- 
sonal, geographical and institutional names or multiwords) and a phase identifier. 
Both are currently in the evaluation step to calculate the precision and the degree of 
enhancement of the search. 

In the following, different works related with the OmniPaper techniques used at 
the implemented AKE are described. 



2.1 Linguistic Techniques and Available Resources 

A Natural Language Processing system (NLP) has as generic goal to translate a 
source representation into a final representation that will be integrated in other system 
with special functionality [1]. The development of these systems must have into ac- 
count the following aspects: what are the features the system must hold?, what are the 
type of texts the system works on?, what is the required functionality? (what are the 
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processes that drive to this functionality?, how are these processes to be linked?, how 
each of them is carried out?), what is the system knowledge to perform its functional- 
ity? (how the knowledge pieces are related?, how is the knowledge represented?). All 
these questions are answering through the life cycle phases of a NLP system (analy- 
sis, design and implementation) and its study helps to design the knowledge model 
we present. 

NLP research mainly has grown from symbolic or statistical systems approaches in 
computer science or linguistics departments, motivated by a desire to understand 
cognitive processes and therefore, the underlying theories from linguistics and psy- 
chology. Practical applications and broad coverage have been poorly worked until the 
later ten years when a growing interest in the use of engineering techniques allows 
new challenges in the field. 

In real applications approaches based in complex grammars become to be difficult 
to maintain and reuse, so current applications employ simple grammars (different 
kinds of finite-state grammars to efficiently processing) and even some approaches 
do away with grammars and use statistical methods to find basic linguistic patterns. 



?xml version="1.0" encoding= "ISO-8859-1 " standalone="yes" ?> 

<documento> 

<articulo id= "DTG200209300145"> 

<head> 

<DREDATE>30/09/2002</DREDATE> 

<Newspaper>THE DAILY TELEGRAPH</Newspaper> 

<Title>Leicester show grit </Title> 

<Author> By Ben Findon</Author> 

</head> 

<body> 

<DRECONTENTxp> LEICESTER CITY 1 WOLVERHAMPTON 0 FROM Leicester in the east to 
Wolverhampton in the west, the Midlands seems a division within a division. All those frantic derby 
battles complicating the wider promotion picture. 

</pxp> It was trench warfare at the Walkers Stadium, where resolve flowed from every pore in 
industrial-strength quantities. 

</pxp>Yellow cards - and one red - peppered the afternoon and statistics showed Leicester winning 
without managing a shot on target. 



Fig. 2. Partial view of a news My News On Line (http://www.mynewsonline.com/) 

The pre-processing analysis builds a representation of the input text. It consists in 
the analysis of chains of symbols by parser that maps each word to some canonical 
representation using different techniques as morphological analysis, morphological 
disambiguation, sallow parsing and others. This canonical representation is, some- 
times, a linguistic one as the lexical root, the lemma ( body for bodies ), a stem by 
suppressing letters ( bodi for bodies ) or any other description domain dependent (less 
usual). 
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Stemming (reduction to a string) is usually associated with simple algorithms 
which do not do any morphological analysis. For example the Porter Algorithm 
(1980) [2], essentially: 

• remove plurals, -ed, -ing, 

• changes the terminal y’ by ’i’ when there’s another vowel in stem, 

• maps double suffixes to single ... -isation, 

• deals with -ic, -full, -ness, take off -ant, -ence, 

• remove -e if word > 2, 

• 



The selection of available stemmers for languages different from English is not a 
simple task. The Porter algorithm successfully substitutes a morphological analysis 
for English (indexing purposes) but, it is not easy to find simple and successful algo- 
rithms for more inflectional languages with a morphology only slightly more com- 
plex than English. Porter and other stemming algorithms that do not extract the base 
form of words via morphological analysis, may fail to identify inflectional variants of 
terms in languages with a morphology only slightly more complex than English and 
may fail to relate synonym words and others given that they does not distinguish 
different meanings of the same word (lexical ambiguity is ignored). 

A morphological analysis of a word form produces a set of possible base forms 
with associated inflectional information [3]. For each occurrence of a word form in 
context, a POS (Part-of-Speech) tagger discriminates which of these base forms is 
more likely in the context. A typical set of tags could be: CC (coordinating conjunc- 
tion), CD (cardinal number), DT (determiner), NN (noun, singular or mass), etc. The 
disambiguation because homonyms, multiple functions of affixes or uncertainty about 
suffix and word boundaries, are solved by rule based and probabilistic approaches. 
The accuracy of hybrid taggers for English has remained constant from 94, whereas 
the rule-based disambiguator of Voutilainen has obtained the best results (99.7%). 

The main morphological problem is the disambiguation of morphological alterna- 
tives, when the same morpheme may be realised in different ways depending on the 
context, dealing with the valid arrangements among stems, affixes and parts of com- 
pounds. Another difficult task concerns the management of the ambiguity related to 
the ability of appropriate uses of words in context by manipulation of syntactic or 
semantic properties of words. 

For English, which is by far the most explored language, the morphology is sim- 
pler than for other languages. English Part-of-Speech tagging may help for example, 
identifying meaningful phrases, even most of languages are using a traditional ap- 
proaches assuming that word use extensibility can be modelled by exhaustively de- 
scribing the meaning of a word through closed enumeration of it senses. There is 
strong evidence that the mechanisms that govern lexical knowledge are related with 
word sense and several attempts have been made to develop a dynamic approach to 
polisemy (when there are multiple senses for the a word) and to create new aspects of 
word use [4] . 
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One of the first available lexicons organized into a semantic map of words was 
WordNet®, a very large thesaurus created at Princeton University [5,6]. English 
nouns, verbs, adjectives and adverbs are organized into synonym sets, each represent- 
ing one underlying lexical concept. Different semantic relations 4 link the synonym 
sets, not definitions of meaning. 

EuroWordNet (EWN) [6] is also available (through a licence agreement), a multi- 
lingual lexical database that includes several semantic relationships among words in 
English, Spanish, German and Italian languages. EWN is structured as a top concept 
ontology that reflects different explicit opposite relationships (v.g., animate, inani- 
mate) and it can be seen as a representation of several vocabulary semantic fields. 
Moreover, it contains a hierarchy of domain tags that relate concepts in different 
subjects, for instance, sports, winter sports, water sports, etc. 

EWN is a European initiative that has been developed by informatics to include 
linguistic expertise about words. It has been used in vary different applications. It is 
important to say that the direct use of the synonyms to expand the user queries in IR 
systems has always fail in precision and recall [7,8]. A new technique for the EWN 
help in semantic processes needed at the Information Extraction (IE) from multi- 
modal documents should be investigated. In OmniPaper project we are trying to find 
an aggregation method for determine the category of news. 

A standard assumption in computationally oriented semantics is that knowledge of 
the meaning of a sentence is a function of the meaning of its constituents (Frege 
1892). The modes of combination are largely determined by the syntactic structure 
and valid inferences from the truth conditions of a sentence. But in real-life applica- 
tions the statement of interpretation becomes extremely difficult especially when 
sentences are semantically but not syntactically ambiguous. For this reason in most 
applications the interpretation of sentences are given directly into an expression of 
some artificial or logical language from where an interpretation is inferred according 
to the context. Also this intermediate level of representation is needed, (for explicit 
reference into the representation) in order to capture the meanings of pronouns or 
other referentially dependent items, elliptical sentences or sentences ascribing mental 
states (beliefs, hopes and intentions). 

Although some natural language processing tasks can be carried out using statisti- 
cal or pattern matching techniques that do not involve deep semantics, performance 
improves if it is involved. But, for most current applications the predictive and evi- 
dential power of a general purpose grammar and a general control mechanism are 
insufficient for reasonable performance. The alternative is to devise grammars that 
specify directly how relationships relevant to the task may be expressed in natural 
language. For instance a grammar in which terminals stand for concepts, tasks or 
relationships and rules specify possible expressions of them could be used. Current 



4 Synonymy - the related word is similar to the entry, as (PIPE, TUBE). 

Antonymy - the terms are opposite in meaning, as (WET, DRY). 

Hyponymy - one term is under or subordinate to the other, as (MAPLE, TREE). 
Metonymy - one term is a part of the other, as (TWIG, TREE). 

Troponymy - one term describes a manner of the other, as (WHISPER, SPEAK). 
Entailment - one term implies the other, as (DIVORCE, MARRY). 
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approaches [9,10] are based in partial or shallow parsing 5 . These partial parses are 
further used in pragmatic issues in order to find an adequate context dependent inter- 
pretation. The problems in this approach are overcome by extending the resources 
with explicit models of the linguistic phenomena or by designing more robustly the 
linguistic analysis, but only before to get an insufficient performance again. 

For the current state of the AKE prototype for queries and news, we are using 
stemmers for English and Spanish and we have adapted the EWN for both languages 
[11]. Currently we are working in (a) the evaluation of the Proper Noun identification 
module to the modification of the keyword measures taking into account during the 
search, (b) in the tuning of the stopword list for different functionalities and (c) in the 
entity recognition using EWN for news categorization. 



2.2 Statistical Techniques for Information Retrieval 

Two different approaches can be considered when trying to configure the relevant 
terms of a document [12]; in the simplest approach, one term is one word and words 
are often preprocessed (they are stemmed, stopwords 6 are dropped, etc). In a more 
complex approach, not only single words are considered as indexing terms, but also, a 
phrase can be a term. A phrase is “an indexing term that corresponds to the presence 
of two or more single word indexing terms”, [13], although we consider two or more 
single words that can be or not indexing terms. 

The notion of phrase can be considered in a syntactical or a statistical way, [14]. In 
the syntactical notion, techniques used for detecting the presence of phrases in the 
text will be NLP-oriented. On the other hand, methods based on n-grams 7 [15], or on 
the co-occurrence of two words in the text, should be applied. This statistical notion 
has the advantage of language independence, as it does not require preprocessing of 
words and elimination of stopwords. 

Whatever the types of terms are, numeric weights are commonly assigned to docu- 
ment and query terms. The “weight” is usually a measure of how effective the given 
term is likely to be in distinguishing the given document from other documents in the 



5 The term of shallow syntax or parsing refers to a less complete analysis that the output from 
a conventional parser to annotate texts with superficial syntactic information. It may identify 
some phrasal constituents, such as noun phrases with indication about neither their internal 
structure nor their function on the sentence. Also it can be identify the functional role of 
some words as the main verb and its direct arguments. 

Shallow parsing normally works on top of morphological analysis and disambiguation to 
infer as much syntactic structure as possible from the morphological information and word 
order configuration. Work remains to be done on the integration of shallow with deeper 
analysis to solve co-ordination and ellipsis phenomena as well as in interfacing morphologi- 
cal descriptions with lexicon, syntax and semantics in a maximally informative way. 

6 Stopwords are words not relevant for information retrieval using statistical technologies (e.g. 
articles, prepositions, conjunctions, etc.) 

7 N-gram processing is a technique based on a window formed by a fixed number of characters 
(N) that is sliced through the text. 
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in the collection (between 0 and 1 if it is normalized). Weights can also be assigned to 
the terms in a query. 

In defining what a keyword is, in bibliography different studies: 

• [16] recommend that the terms found repeatedly in a document are appropriate for 
indexing, so the weights can be ranking by means of term frequency relative to 
frequency in the overall corpus; 

• [17] argue that words found in the document under study, but rarely in other 
documents, are important, so the use of the inverse document frequency (idf) as a 
term score is appropriate; 

• [18] propose the combination of the idf with the in-document frequency by taking 
their product as a measure of term importance. 

Statistical frameworks break documents and queries into terms; these terms repre- 
sent the population that is counted and measured statistically. In information retrieval 
tasks, what the user really wants is to retrieve documents that are about certain con- 
cepts and these concepts are described by a set of keywords. Of course, a document 
may deal with multiple subjects. Generally speaking, the set of terms that describe a 
document is composed of all the words (or phrases) of the document except stop- 
words (this is idea of the so-called full-text search); optionally, these words could be 
stemmed 8 . 

Moreover, not every word is used for indexing a document: usually, a filtering 
method is performed in order to select the most adequate words, which configure the 
keywords of a document. This is the approach taken in the vector space model, de- 
scribed below. 

In the vector space model documents are represented by a set of keywords ex- 
tracted from the documents themselves. The union of all set of terms is the set of 
terms that represents the entire collection and defines a “space” such that each distinct 
term represents one dimension in that space. Since each document is represented as a 
set of terms, this space is the “document space”. 

In the document space, each document is defined by the weights of the terms that 
represent it (user queries are represented in the same way), that is, the vector dj = 
(wd jP wd r , . . . ., wd jni ) where m is the cardinality of the set of terms and wd jt repre- 
sents the weight of term i in document j. 

The most successful and widely used scheme for automatic generation of weights 
is the “term frequency * inverse document frequency” weighting scheme: 

• The expression used for the term frequency could be the number of times a word 
appears in the document normalized to allow for variation in document size. The 
term frequency (TF) for term i in document j is tf = n / maxtf where n is the 
number of occurrences of term i in document j, and maxtf is the maximum term 
frequency for any word in document j. 



A keyword can be used for IR in different forms. For example bodies can be transformed into 
(a) lexical stem bodi, (b) stem bod (suppressing letters) (c) lemma body, (d) phrase the body. 
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• The inverse document frequency (IDF) for word i is idf = log 2 (N/ n i )+l where N 
is the total number of documents in the collection and n. is the number of 
documents in the collection where the word i appears. 

Computing the weight of a given term in a given document as tf * idf implies that 
the best descriptors of a given document will be terms that occur a good deal in the 
given document and very little in other documents. Similarly, a term that occurs a 
moderate number of times in a moderate proportion of the documents in the collec- 
tion will also be a good descriptor. This is the approach used in the OmniPaper pro- 
ject. 



Without Stemming 


With Stemming 


<key-list> 


<key-list> 


<keyword weight="5.24"> leaving </keyword> 
<keyword weight="5.00"> showed </keyword> 
<keyword weight="5.00"> body </keyword> 


<keyword weight="3.80"> leav </keyword> 
<keyword weight="6.00"> show </keyword> 
<keyword weight="4.58"> bodi </keyword> 


<keyword weight="4.90"> received </keyword> 


<keyword weight="4.00"> receiv </keyword> 


<keyword weight="4.80"> minute </keyword> 
<keyword weight="4.45"> minutes </keyword> 


<keyword weight="7.61 "> minut </keyword> 


</key-list> 


</key-list> 



Fig. 3. Keywords weights with and without stemming 



Could this approach be considered as a valid one? It was empirically stated that not 
every word that appears in a document is helpful for characterizing a document: only 
words with a good discrimination capability are useful to build a document 
characterization. Words that are present in all documents do not allow to recognize a 
subset of documents in the collection. On the contrary, if a word appears only in one 
document, it only discriminates one document in the entire collection, so it is not 
useful to characterize a document subset in the collection. 

As a matter of fact, the final goal pursued by the vector components is to 
categorize documents in the collection according to the query stated by the user. 
Vector components should thus be selected with this purpose in mind. A frequency 
threshold is going to be use for selecting the words that will become vector 
components. In a generic document collection, if a word appears in more than 10 per 
cent of documents and in less than 90 per cent of documents, then that word should 
be used as a document keyword. Our tests have tried different thresholds (Section 5). 

This weighting schema assumes the existence of a static collection of documents 
on which each query formulated by a user is applied. So, what happens if a new 
document is added to the set?, that is, if there is not a fixed collection of documents. 
In this case, every idf measure should be recalculated, and index term selection 
according to the selected frequency threshold should be performed again. 
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Usually, it is possible to provide a training set of typical documents for which idf 
frequencies can be calculated. This implies the assumption that all the subsequent 
documents received by the system will have the same “statistical properties” as the 
training set [19]; the alternative is to update the training set regularly. OmniPaper 
approach is an incremental one. The idf measure is updated with each new document; 
subsequent document components will therefore be computed with actual values, but 
the components previously computed become progressively obsolete. Under the 
above assumption, to be safe, all vector components are recomputed at selected points 
in time. In this way, the document collection is incrementally maintained updated. 

As final remark, it seems to be agreed upon that the occurrence frequency of a 
word in the document under study should not be predicted by observing the remain- 
der in order to consider that word a good keyword. Among the most recent algo- 
rithms for extracting keywords, KEA [20] proposes the key-phrase extraction as a 
machine learning problem and the work of [8] propose the use of phrases for brows- 
ing multilingual documents. 



3 Overall Description of the OmniPaper System 



The OmniPaper system has two separate functionalities: management of user queries 
(download subsystem) and processing of documents (upload subsystem). Figure 4 
shows the way in which the system uploads a news document whenever a new one is 
received. 




Fig. 4. OmniPaper Uploading Architecture 
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The upload process works as follows: when new articles arrive to a news provider, 
it sends their IDs to the upload manager through SOAP; then, this manager sends a 
SOAP request to the news provider asking for a given document, identified by its ID, 
and receives it. Metadata included in the document header is sent to the Metadata 
repository, where information such as author name, publishing date or news subject is 
stored. 

Then, a tokenization process is performed on the document, including stopwords 
filtering, stemming, and proper names detection. After tokenizing documents, the 
keywords extraction process takes place. Using statistical techniques combined with 
linguistic processing, the extracted keywords are weighted according to their rele- 
vancy. Each document is then stored as a vector following the vector space model. 
Document vectors are used for performing the topic learning, a process that classifies 
the document as part of the topic thesaurus. Finally, all the information about the 
document is used for obtaining web links, both to external personal or corporative 
websites and to other news in the collection. 




USER 




Fig. 5. OmniPaper Downloading Architecture 

As a result of the uploading process, the system database is maintained updated 
with meta-information and indexing information on the news known to the system. 
Based on this information, the downloading subsystem processes a user query and 
searches the database accordingly for the possible matches. The downloading subsys- 
tem is depicted in Figure 5. A natural language user query is processed with NL and 
statistical techniques, much in the same way as for the uploading process. As a result, 
a vector is obtained that represents the query. This vector is used for indexed search 
in the vector database, using matching measures like, e.g. cosine similarity [21]. 
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The result of searching is a set of news IDs relative to the news provider. These 
IDs can then be used for retrieval of the document from the provider. Multilingual 
results can be obtained by translating the keyword components of the search vector, 
or from the Topic Maps relating concepts and keywords in different languages. An 
optional translation process can then be performed on the document text using exter- 
nal translators for this purpose (note, however, that the user query is not translated, 
but processed in its original language). 

Many of the downloading subsystem features outlined above are currently under 
further development and refinement. In the rest of this paper we will mainly focus on 
the uploading process (although many of the techniques used in this process will also 
be applicable to user queries, as mentioned before). 

4 AKE First Prototype 

This section describes system features from a development point of view: develop- 
ment platform, development tools, data structures required, and design issues. 

Within OmniPaper, a standard format for the news contents has been defined. The 
content of the system is actually the metadata of the articles; it is written in XML, 
which helps its interaction with standards like RDF and/or XTM. The connection 
with the content providers is via SOAP. Requests are used to retrieve documents for 
processing on uploading and to show the contents to the user on downloading, once 
the system has determined, based on the abovementioned metadata, that a news arti- 
cle fulfils a user’s request. 

The standard format for the metadata has been defined following widely accepted 
standards, in particular the Dublin Core Metadata Element Set (DCMES), and the 
News Industry Text Format (NITF), but also NewsML [23]. The format describes 
twenty three basic elements, grouped under these categories: Identification, Owner- 
ship, Location, Relevance, Classification, and Linklnfo. 

• The Identification category includes an identifier for the news metadata refers to 
(so that it can be requested to the provider), and sub-elements like Creator, Title, 
Subtitle, Publisher, Language, etc. 

• The Relevance category contains information that can be used to compare with the 
user profile to decide on how the news fits the intention of particular users based 
on their behaviour. 

• The Linklnfo category contains links and references to related documents that can 
be used to improve the results on particular information requests. 

• The Classification category contains sub-elements that allow classifying the docu- 
ment on certain criteria; in particular, and most relevant to the present paper, the 
Key_list and the Subject. 

The Key_list has sub-elements which are the keywords identified in the keyword 
extraction process described in the rest of this paper. Based on these keywords, the 
document can be classified obtaining a Subject, which is also a sub-element of Classi- 
fication. This subject is usually the main one, and can be used for relating the docu- 
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ment to other documents based on the navigation of a Topic Map of categories and 
related subjects. However, the description of this map is outside the scope of this 
paper. 

AKE Subsystem prototype has been implemented in a development environment 
with the following characteristics: 

• Linux RedHat 7.3 and Debian operating systems, 

• GNU C/C++ and Ciao Prolog programming languages and 

• MySQL 3.23 DBMS. 

The MySQL database management system has been chosen due to its efficiency 
and performance characteristics when working with the indicated operating systems. 
The simple database structure used by the system is depicted in Figure 6. 



Metadata 



Word DF 




~o 

Linguistic 

Category 



Fig. 6. Database structure used by the AKE System 

There are five basic relations: 

• Document. Stores metadata information extracted by the preprocessing of texts. 
Documents are assigned a unique identifier stored in this relation, and used for ref- 
erencing a document across the keyword extraction system. 

• Word. Contains all distinct words recognized by the linguistic process run over all 
documents in the collection. It stores Document Frequency (DF) information, i.e., 
the number of distinct documents that a given word contains. This DF value will 
be needed to obtain Inverse Document Frequency (IDF) for a word. 

• Contains. This relation is used for registering words that appear in a document 
together with their Term Frequency inside the document. 

• Vectorize. Used for storing vectors built for documents in the collection. These 
vectors are formed by words and their associated weights. 

• Configuration. The AKE Subsystem can be configured according to the parame- 
ters stored in this structure. Actually, stopword files locations for each target lan- 
guage (English, French, German, Dutch and Spanish), maximum and minimum 
frequency thresholds and a flag indicating whether stemming is going to be applied 
or not, are parameters stored in this relation. 
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Disk storage requirements have also been analyzed. The collection managed by 
content providers involved in OmniPaper project has about 5*10 6 documents. With 
the mentioned database structure and taking into account field sizes, about 35 GBytes 
are needed to store all information managed by the AKE Subsystem for all docu- 
ments in the collection. This is not a high requirement for actual storage technologies, 
even when considering the future collection growth. 

AKE Subsystem has been designed using an object oriented methodology and im- 
plemented using standard GNU C/C++ classes. There are five key classes which 
implements basic system functionality and a set of secondary classes to give support 
for database access and data structures management. 

The class diagram designed for the system has basically, a class definition for each 
system component. 

1. CTokenizer class is centred in text processing, splitting text into words using a 
tokenizer based on an implementation developed by Daedalus S.A. [23], so it has 
been deeply tested since it is a core component of company products. 

2. CDocument class supplies necessary functionality for term frequency and vector 
management. 

3. CDictionary implements memory structure to store in main memory all the distinct 
words that appear in indexed documents, based on CTree object. 

4. CTree class provides an implementation of a binary tree where nodes are two field 
structures containing word-identifier or identifier-number pairs. This kind of nodes 
allows dealing with dictionary entries or vector components. It uses the standard 
GNU implementation of the Knuth T algorithm, which speeds up memory 
searches. 

5. CIndexer uses all these classes for obtaining document keywords and their associ- 
ated vectors. Of course, stopwords are removed through class. 

6. CStopword, which loads different stopword files based on target document lan- 
guage. Stemming can be optionally applied using CPorter class, which includes an 
implementation of Porter Algorithm taken from [2], 

The object oriented system design described in this section is aimed at obtaining 
scalable and maintainable software. This will allow new linguistic and statistical 
techniques to be easily applied to the indexing process. Comparison between different 
keyword extraction techniques in a multilingual environment is one of the major 
objectives in the OmniPaper project. 



5 Evaluation of the Automatic Keyword Extraction Process 

AKE Subsystem has been tested using a news dataset supplied by My News On Line 
(http://www.mynewsonline.com/), also involved in OmniPaper project. This collec- 
tion is formed by 1881 English news articles published during Sept. 2002. Each arti- 
cle is described in XML, including some metadata and the average document length 
(stopwords removed) is 250 words. For the 1881 documents near twenty queries have 
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been defined. These queries are accompanied by the collection documents that match 
with each query. The docs that match with the queries were manually identified (next 
evaluation step is going to be done in the CLEF2004 (see below for more details). 

There are two basic configuration parameters that must be taken into account for 
test and evaluation purposes. These parameters are: 

• Stemming. If stemming is applied words will be represented by a canonical form, 
the word stem. This would lead to a grouping of words into the same representa- 
tive stem, reducing dictionary size. 

• Vector dimension and keywords quality. 

As already mentioned, keywords are selected according to the word DF into the 
whole collection. Tests ran for evaluating the system considering the following Fre- 
quency Thresholds (FT): 

• 5% - 90% of total documents in collection. This threshold has been established 
considering empirical results. In this experimental work, the minimum a priori FT 
of 10% turned to be ineffective because some documents were not assigned any 
keyword. 

• 0% - 100% of total documents in collection. In this test no keyword selection is 
applied according to DF. It will be interesting to evaluate differences in processing 
times in the future. 

• 5% - 100% of total documents m collection. 

• 0% - 90% of total documents in collection. 

Results of performance evaluation obtained for the different executions carried out 
with the system are summarised in the following. Tests have been ran over a com- 
puter with an Intel Pentium III 800 MHz processor with 256 MB of RAM. 



Table 1. Basic experimental results for Keyword Extraction Subsystem 



| Stemming Applied | 


Frequency 

Threshold 


Aver. Key w 
per Docs. 


Total Keyw. 
in collection 


Total Die. 
Entries 


Processing time (s) j 


Indexing Time 


Vector Const. 


5% - 90% 


72.4003 


726 


29,684 


179 


Ill 


0% - 100% 


167.7363 


29,684 


29,684 


172 


161 


5% - 100% 


73.3615 


727 


29,684 


177 


112 


0% - 90% 


166.7751 


29,683 


29,684 


199 


170 


| Stemming NOT Applied | 


Frequency 

Threshold 


Aver. Key w 
per Docs. 


Total Keyw. 
in collection 


Total Die. 
Entries 


Processing time (s) | 


Indexing Time 


Vector Const. 


5% - 90% 


50.0138 


537 


42,265 


199 


104 


0% - 100% 


178.1637 


42,265 


42,265 


208 


172 


5% - 100% 


50.9761 


538 


42,265 


199 


105 


0% - 90% 


177.2015 


42,264 


42,265 


200 


173 
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Results depicted in Table 1 show some interesting facts: 

1. When applying stemming, the total number of dictionary entries is considerably 
reduced, as well as processing time. Considering execution times and storage re- 
quirements, stemming leads to a more efficient system. It will be necessary to 
prove if the values of traditional quality measures for Information Retrieval sys- 
tems, like recall and precision, are better when stemming is applied. 

2. Stemming process leads to a greater number of keywords per document but a 
smaller number of total keywords in the document collection. This effect is due to 
variations in words frequency distribution. With stemming, words are grouped un- 
der the same stem, so that more stems surpass minimum FT than simple words in 
the no stemming approach (there are less stems in total than simple words). 

3. The more restrictive FT is Minimum Frequency Threshold. If figures from 5% - 
90% and 5% - 100% thresholds are compared, differences in average keywords per 
document and total keywords in collection are very similar. On the contrary, com- 
parison between 5% - 90% and 0% - 90% thresholds show greater differences. 
These results could point out some deficiencies in the tokenization process, as a lot 
of different words are appearing in very few documents. This point will be 
checked according to word lists obtained from the tokenizer. 

Up to now, we have been involved in the Cross-Language Evaluation Forum 2003 
(CLEF) [24], with some of the techniques that we plan to improve in the OmniPaper 
approach for the multi-lingual retrieval. 

The cross-lingual IR (CLIR) problem can be reduced to traditional Information 
Retrieval via: 

1. Translation of the query into the target language of the documents. 

2. Translation of the documents into the source language of the query. 

3. Mapping of queries and documents into an intermediate (language-independent) 
indexing space. 

The key factor to preserve the accuracy of a monolingual retrieval is finding an op- 
timal mapping of terms from one language to another. Evaluation of proposed tech- 
niques uses to be evaluated at the CLEF conferences. 

The Cross-Language Evaluation Forum is an annual conference whose main objec- 
tive is to constitute a reference framework to evaluate and compare multilingual 
(cross-language) information retrieval systems and approaches. The Forum is organ- 
ized as a context among research groups which are working in the information re- 
trieval area. The organization provides all participant groups with a document collec- 
tion including over 1.5 million documents in 9 different languages (Spanish, English, 
French, German, Italian, Finnish, Swedish, Russian and Chinese) and also proposes a 
set of topics: structured statements of information needs from which queries are ex- 
tracted and which are then searched in the document collection. The main goal is to 
evaluate and compare the different systems by performing relevance assessments 
with the aim to create a community of researchers and developers studying the same 
problems and to facilitate collaborative initiatives between groups. 
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CLEF offers a series of evaluation tracks to test different aspects of cross-language 
information retrieval system development: monolingual, bilingual and multilingual 
information retrieval, image search (ImageCLEF), mono- and cross-language infor- 
mation retrieval on structured scientific data (GIRT), interactive cross-language in- 
formation retrieval (iCLEF), multiple language question answering systems (QA- 
CLEF) and cross-language spoken document retrieval (CL-SDR). 

The MIRACLE (Multilingual Information RetrievAl at CLEF) team is a joint ef- 
fort of different research groups from two universities and one private company, with 
a strong common interest in all aspects of information retrieval and a long-lasting 
cooperation in numerous projects. Different experiments were submitted to the CLEF 
2003 campaign main track, in the context of monolingual (Spanish, English, German 
and French), bilingual (from Spanish and French to English and from Italian to Span- 
ish) and multilingual-4 (French, English, German and Spanish languages) tasks. 

Our approach focuses on the mixed approach combining statistical and linguistic 
resources. Techniques vary from automatic machine translation, strategies for query 
construction, relevance feedback to topic term semantic expansion using WordNet. 
The main aim behind the MIRACLE participation is to compare how these different 
retrieval techniques affect retrieval performance. We also participated in ImageCLEF 
track (for a description of our work and obtained results see [25]). 

ImageCLEF [26] is a pilot experiment first run at CLEF 2003, which consisted on 
cross-language image retrieval using textual captions. A collection of nearly 30,000 
black and white images from the Eurovision St Andrews Photographic Collection 
[10] was provided by the task coordinators. Each image had an English caption (of 
about 50 words). Sets of 50 topics in English, French, German, Italian, Spanish and 
Dutch were also provided. Non-English topics were obtained as human translations of 
the original English ones, which also included a narrative explanation of what should 
be considered relevant for each image. The proposed experiments were designed to 
retrieve the relevant images of the collection using different query languages, there- 
fore having to deal with monolingual and bilingual image retrieval (multilingual re- 
trieval was not possible as the document collection was written only in one language). 



6 Conclusions and Future Work 

This paper focuses on the description of the first prototype for Automatic Keyword 
Extraction using the Vector Space Model technique and simple linguistic resources in 
the OmniPaper project. A first prototype of the up-loading subsystem has been de- 
scribed and some relevant performance features have been presented regarding the 
AKE evaluation step. Our main goal is to find a hybrid approach in which document 
classification is going to be performed with general-purpose resources for several 
languages. 

Currently we are working in the enhancement of the AKE with Multiword key- 
word extraction based on the n-gram approach, heuristics for Proper Name identifica- 
tion and lexical-semantic-resources for entity recognition. Also, we are working in 
the analysis of the feasibility of the adaptation or development of concrete domain 
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ontologies using knowledge engineering techniques and the standards promoted by 
the semantic web initiative [28,29]. 

In the near future we are going to try to get further advances in the state of the art 
by allowing queries to include topics 9 . In order to do this, a semantic processing of 
keywords is required [30], This implies: first, semantic analysis of free text in the 
queries (considering number, gender, compound terms, synonyms, speech context, 
etc.); second, a semantic access to the database (grouping keywords into topics, relat- 
ing topics to each other, considering scope of the topics, etc.). 

One other technique that we plan to explore in the future is that of clustering: it can 
help in the process of abstracting from document-related vectors to a more general 
setting in which vectors represents concepts. 

A central issue in this kind of systems is the query-document matching. Our first 
approach has been the characterization of the queries by a vector in the vector space. 
In the following, we want to test several similarity measures and define a way to 
integrate or discard some of them in order to match queries and documents in a more 
accurate process, for instance, considering several metadata fields as well as 
enrichement of queries with sematically related terms using previous experiences of 
the authors. 

Regarding the matching, arguably one of the most promising techniques used in 
this context is fuzzy logic. The vector space model seems quite adequate for the ap- 
plication of such a technique, where the vector components could be regarding as the 
arguments of a logical predicate (which represents a document or a user query). 

Moreover, the coupling of both clustering and fuzzy logic could have very good 
performance: clustering can yield fewer vectors, thus better indexing, fuzzy logic- 
based search can then broaden the focus of the search. Nevertheless, this point de- 
serves further investigation. 
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Abstract. As more and more knowledge and information becomes available 
through computers, a critical capability of systems supporting knowledge man- 
agement is the classification of documents into categories that are meaningful 
to the user. In a step beyond the use of keywords, we developed a system that 
analyzes the sentences contained in unstructured or semi-structured documents, 
and utilizes an ontology reflecting the domain knowledge for a semantic classi- 
fication of the documents. An experimental system has been implemented for 
the analysis of small documents in combination with a limited ontology; an ex- 
tension to larger sets of documents and extended ontologies, together with an 
application to practical tasks, is the focus of ongoing work. 



1 Introduction 

With the volume of knowledge and information available to computer user increasing 
at an ever-accelerating rate, the need for an effective mechanism to organize not only 
information, but also knowledge becomes critically important. We distinguish knowl- 
edge from information and define knowledge as “fluid mix of framed experience, 
values, contextual information and expert insight that provide a framework for 
evaluation and incorporating new experiences and information” [9]. 

Information retrieval techniques such as document clustering techniques have been 
employed frequently to support the organization and retrieval of information [1]. 
Document clustering is essentially an unsupervised process where a large collection 
of text document is organized into groups of documents that are related, without de- 
pending on external knowledge [10]. A potential problem with the data-driven clus- 
tering algorithms is the inability to correctly identify cases when different words are 
used to describe the same concept. This is due to the similarity-based measure 
adopted in the algorithm. Furthermore, without including the user context, more often 
than not, information is organized according to the fixed viewpoint of the conven- 
tional clustering methods, rather than reflecting the interests of the user [1]. This will 
ultimately discount the usefulness of the information. Typically, such retrieval tech- 
niques leave a significant portion of the utilization of knowledge contained in the 
retrieved documents to the user: these techniques are only used to calculate a ranking 
of the documents, attempting to identify the ones that are most relevant to the user. 
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The core principle of our approach is based on our belief that information has to be 
organized in a manner that is intuitive and relevant to the user to be useful. A context 
model is the ontology of an application domain, which defines the meanings of vo- 
cabularies used within the ontology according to the user perspective. The context 
model attempts to emulate the mental processes of perception and categorization. We 
have developed the Ontology-based Semantic Classification (OSC) framework, lever- 
aging on natural language processing techniques and ontologies to incorporate the 
user current context into the categorization of information. Figure 1.0 illustrates the 
overall process where unstructured documents are categorized according to the user 
perspective. In Section 2, we discuss the usage of ontology in the OSC framework. 
Section 3 presents the various components employed. In Section 4, we show the im- 
plementation as well as the results of the experiments performed. In Section 5, we 
summarize our findings and future endeavor. 



Categorization 



Natural Language Processing 



Unstructured 

Documents 



High value 




Low value 



Fig. 1 . Overall classification process 



2 Ontology 

An ontology can be defined as specification of a representational vocabulary for a 
shared domain of discourse which may include definitions of classes, relations, func- 
tions and other objects [2]. An ontology includes a selection of specific sets of vo- 
cabulary for domain knowledge model construction, and the context of each vocabu- 
lary is represented and constrained by the ontology. Therefore, an ontological model 
can effectively disambiguate meanings of words from free text sentences, overcoming 
the problem faced in natural language where a word may have multiple meanings 
depending on the applicable context [3]. 

Vocabularies used in an ontology are two kinds: 1) a direct subset of a natural lan- 
guage (e.g., "entity", "tree", and "basketball"), and 2) user-created ’words’ that does 
not exist in an natural language (e.g., "ALLFRD"). Depending on the construction of 
the ontology, the meaning of those words in the ontology could remain the same as in 
natural language, or vary completely. The meaning of ontological terms that are not 
derived directly from a natural language can still be captured by a natural language. 
For example, the word “COM” used in a specific ontology means “Common Object 
Model” in English. 

From an engineering perspective, ontologies can be very helpful with the reuse of 
domain knowledge, and for the separation of domain knowledge and software code 
that performs operations on that knowledge. We have adopted ontologies as the link 
to incorporate user-specific context into the categorization process within the frame- 
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work. Essentially, within the Ontology-based Semantic Classification (OSC) frame- 
work, the context model is represented by the signature and category ontology. 
Signature ontology is a logical grouping of keywords having the same meaning. For 
example, the signature SEARCH contains the keywords "A*", "Depth First" and 
"Breadth First". Signature ontology is used in Context-based Free Text Interpreter 
(CFTI) to extract signatures from unstructured documents. This is the first stage of the 
categorization process. The category ontology is a higher level of logical grouping 
used in the next stage. A category contains signatures with the same context. For 
example, the category AI is a logical grouping of the signatures SEARCH and 
AGENT. Context-based Categorization Agent (CCA) employs the category ontology 
to categorize the signature instances extracted from the unstructured documents. 



3 Semantic Classification 

Linguistically, humans combine understanding of relatively small textual units in 
order to understand larger textual units, guided by syntactic and semantic rules 
[ 1 1 ] [ 1 2] . Syntax relates to arrangement, and semantic to the meaning of words. Simi- 
larly, it is necessary for a natural language processing system to be able to address 
syntactic and semantic aspects of natural language [3]. Subsequently, to perform use- 
ful classification, the categorization must be based on the actual information content 
or explicit representation of the information content of the source documents. In this 
section, we introduce two existing language tools (i.e., Link Grammar Parser and 
WordNet), and the design of Context-based Free Text Interpreter (CFTI) and Context- 
based Categorization Agent (CCA). 



3.1 Syntactic Analysis 

Natural language syntax affects the meaning of words and sentences. The very same 
words can have different meanings when arranged differently. For example: “a 
woman, without her man, is nothing” and ‘‘a woman: without her, man is nothing” 
(http://www.p6c.com/joke of the week.html). The Link Grammar Parser, developed at 
Carnegie Mellon University [4], assigns to a given sentence a valid syntactic struc- 
ture, which consists of a set of labeled links connecting pairs of words. It utilizes a 
dictionary of approximately 60,000 word forms, which comprises a significant variety 
of syntactic constructions, including many considered rare or idiomatic. The parser is 
robust; it can disregard unrecognizable portions of sentences, and assign structures to 
recognized portions. It is able to intelligently guess, from context and spelling, prob- 
able syntactic categories of unknown words. It has knowledge of capitalization, nu- 
meric expressions, and a variety of punctuation symbols. 

The basis of the theory of Link Grammar is planarity, described by [5], as a phe- 
nomenon evident in most sentences of most natural languages. To represent a sen- 
tence, arcs are drawn connecting words with specified relationships within sentences. 
These arcs do not cross for syntactically correct sentences. Planarity is defined in 
Link Grammar as “the links are drawn above the sentence and do not cross” [4]. To 
visualize link grammars, think of words as blocks with connectors coming out. There 
are different types of connectors; connectors may also point to the right or to the left. 
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Fig. 2. Each word is a block with connectors [6] 




the cat chased a snake 



Fig. 3. A valid sentence contains blocks connected without a cross [6] 




the Mary chased cat 

Fig. 4. An invalid sentence contains blocks connected with crosses [6] 



A sentence is valid if all the words present are used according to their rules, and cer- 
tain global rules are satisfied [6]. Each word is a block with connectors (see Figure 2). 

Each intricately shaped, labeled box is a connector. A connector is ‘satisfied’ when 
‘plugged into’ a compatible connector (as indicated by shape). A valid sentence is one 
in which all blocks are connected without a crossing. An example of a valid sentence 
is “the cat chased a snake’’ (Figure 3). An example of an invalid sentence is “the Mary 
chased cat”, which contains a cross (Figure 4). 

The Fink Grammar Parser identifies all valid linkages within a free text input, and 
outputs them as grammatical tree. For example, an input such as “The brown fox 
jumped over that lazy dog” would result in the output shown in Figure 5: 
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+ Ds + 

+ A- - + Ss- - + - -MVp- + 



+ 



Dsu + 

+- -A- -+ 



the brown. a fox.n jumped. v over that.d lazy. a dog.n 

Constituent tree: 

(S (NP The brown fox) 

(VP jumped 



3.2 Semantic Knowledge 

Two types of semantic knowledge are essential in a natural language processing sys- 
tem: Lexical knowledge among words independent of context (e.g., “children” as the 
plural form of “child”, and the synonym relationship between “helicopter” and 
“whirlybird”) and contextual knowledge (i.e., how meanings are refined when used in 
a specified context). In Context-based Free Text Interpreter (CFTI), lexical knowl- 
edge is acquired through integration of the system with the WordNet database, and 
contextual knowledge is acquired by tracking contextual meanings of words and 
phrases during and after development of an ontology (i.e., context model). 

WordNet, an electronic lexical database, is considered to be the most important re- 
source available to researchers in computational linguistics, text analysis, and many 
related areas [7], WordNet has been under development since 1985 by the Cognitive 
Science Laboratory at Princeton University under the direction of Professor George 
A. Miller. Its design is “...inspired by current psycholinguistic theories of human 
lexical memory. English nouns, verbs, and adjectives are organized into synonym 
sets, each representing one underlying lexical concept. Different relations link the 
synonym sets.” [8] 



(PP over 



(NP that lazy dog) ) ) 



.) 



Fig. 5. An output produced by the Link Grammar Parser 



ID: 100008019 



ID: 100002086 



“a living organism characterized 
by voluntary movement” 



"any living entity" 




Type Of 




Fig. 6. Two synsets with a ‘type-of’ relationship 




Ontology-Based Semantic Classification of Unstructured Documents 125 

The most basic semantic relationship in WordNet is synonymy. Sets of synonyms, 
referred to as synsets, form the basic building blocks. Each synset has a unique identi- 
fier (ID), a specific definition, and relationships (e.g., inheritance, composition, en- 
tailment, etc.) with other synsets. Two synsets with a “type-of ’ relationship are shown 
in Figure 6. The first synset has an ID “100008019”, a definition of “a living organ- 
ism characterized by voluntary movement”, and contains six individual words (e.g., 
“animal”, “animate being”, etc.). The second synset has an ID “100002086”, a defini- 
tion of “any living entity”, and it contains three words (e.g., “life form”, “organism”, 
and “being”). The first synset is a “type-of’ the second synset. WordNet contains a 
significant amount of information about the English language. It provides meanings 
of individual words (as does a traditional dictionary), and also provides relationships 
among words. The latter is particularly useful in linguistic computing. 

While WordNet links words and concepts through a variety of semantic relation- 
ships based on similarity and contrast, it “does not give any information about the 
context in which the word forms and senses occur” [7]. In Context-based Free Text 
Interpreter (CFTI), refinement of word meanings in specific contexts (i.e., contextual 
knowledge) is accomplished by mapping relationships between natural language and 
a context model represented by the signature ontology. In practice, the tracking of 
mapped relationships between a natural language sentence and a context model is a 
process of interpretation of the model (i.e., what a model really means) through the 
use of a natural language. From the perspective of a natural language processing sys- 
tem, which employs appropriate lexical and contextual knowledge, the interpretation 
of a free text sentence is a process of mapping the sentence from natural language 
through a context model (Figure 7). 
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Fig. 7. Mapping from natural language to signature instances through signature ontology (con- 
text model) 
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Different context models may produce different results simply because words can 
have different meanings in different contexts. In Context-based Free Text Interpreter 
(CFTI), the representation of meaning is accomplished by manipulations of a context 
model (i.e., creation, modification, and deletion of objects and relationships in the 
signature ontology). For example, a hazard detection system receives a free text sen- 
tence “House 303 is on fire!”. If the system is able to model this information correctly 
(i.e., locate the instance of the house in the model and set its attribute to “on fire”), 
then it is assumed that the system understands the meaning of the sentence [3]. 



3.3 Context-Based Free Text Interpreter (CFTI) Design 

CFTI leverages on the Link Grammar capability for syntactical analysis of a sentence. 
At the same time, the lexical meaning analysis of a sentence is supported through the 
integration with the WordNet database [3]. The tasks performed by CFTI are summa- 
rized as follows: 1). Analyze the syntactic structure of the sentence. 2). Analyze the 
lexical meaning of the words in the sentence. 3). Refine the meanings of the words 
through the application of a signature ontology (context model). 4). Represent the 
meaning of the sentence in signature instances. Figure 8 illustrates the processing of a 
free text message by the CFTI system and the subsequent representation in the signa- 
ture instances. 

Even though the CFTI requires an ontological model for the acquisition of contex- 
tual knowledge and the representation of meanings, the system is not constrained by 
any particular knowledge domain. A system change from one ontological model to 
another does not require significant system reconfigurations. 




Fig. 8. From free text messages to signature instances 
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3.4 Context-Based Categorization Agent (CCA) Design 

The signature instances produced from Context-based Free Text Interpreter (CFTI) 
correlate the content of the unstructured documents with the context of the user. A 
signature is defined as a logical grouping of keywords having the same context in the 
user perspective. For example, in the user perspective, keywords such as “A*”, 
“Depth First” and “Breadth First” sharing the same context can be group under the 
signature SEARCFI. CCA offers flexibility and scalability by providing a higher level 
of grouping: signatures with the same context are grouped in the same category. CCA 
relies on the category ontology to incorporate the user perspective. The category on- 
tology specifies how the signatures are grouped with reference to the user context. For 
example, if the signatures SEARCFI and GAME share the same context, they can be 
grouped in the same category. In addition, the category ontology supports class hier- 
archy similar to the object-oriented paradigm, where parent child relationship exists. 
A key feature of such an approach is the capability to adapt to changes dynamically, 
without recompilation of the CCA. This is especially important since changes had to 
be made to the category ontology frequently, to reflect the changes in the user per- 
spective, which evolves as the volume of information increases. For example, modifi- 
cation is made to the category ontology as new categories are created or new signa- 
tures are added to existing categories. 




Fig. 9. From signature instances to category instances through category ontology 



CCA provides the flexibility and scalability to adapt to these changes without any 
recompilation. The tasks performed by CAA are: 1). Interface with the signature 

instances. 2). Interface with the category ontology. 3). Classify the signature instances 
through the application of category ontology. 4). Represent the classification of the 
documents as category instances. Figure 9 illustrates the classification of the signature 
instances by the CAA and the subsequent representation. 
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4 Implementation 



This section explains a prototypical implementation of an ontology-based system for 
the semantic classification of unstructured documents. We demonstrate the feasibility 
of incorporating user context for the task of classifying unstructured documents. 



4.1 Ontology-Based Semantic Classification (OSC) Framework 

The core design principle of the OSC framework is to provide loosely coupled yet 
seamlessly integrated components. To achieve this, the OSC framework architecture 
is decomposed into three distinct layers and the interfaces between the components 
are specified in a language neutral format (e.g. via XML), as shown in Figure 10. 




Fig. 10. System Architecture of the OSC Framework 
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The core engine layer encompasses components that contribute to the core func- 
tionality of the framework. It includes the Context-based Free Text Interpreter (CFTI) 
and Context-based Categorization Agent (CCA). CFTI is implemented through the 
use of CLIPS 6.20. CFTI contains five components: Link Grammar, Lisp Simulator, 
WordNet, a mapping engine, and a signature ontology (context model). The Link 
Grammar and Lisp Simulator process syntactic knowledge; WordNet provides lexical 
knowledge about words; the mapping engine is composed of CLIPS rules for meaning 
extraction from free text sentences; and the context model provides contextual knowl- 
edge about words and representation of meanings of free text sentences [3]. While a 
context model is required by the system, a change from one context model to another 
does not require significant system reconfiguration. CCA was developed in CLIPS 
6.20. It includes two components: a classification engine and a category ontology. 
The classification engine is powered by a network of rules that categorizes the signa- 
ture instances with respect to the interest of the user as specified in the category on- 
tology. The category ontology can be extended dynamically to allow changes without 
recompiling the system. 

The physical storage layer handles the storing of the context model represented by 
the signature and category ontologies, signature and category instances and the un- 
structured documents. The interface between the physical layer and the rest of the 
components is confined to a language neutral format such as XML, ensuring the loose 
coupling between the different layers. Applications that have to interact with the 
physical layer can be written in any programming language as long as that language 
supports XML. On the other hand, the signature and category ontologies (context 
model), signature and category instances and the unstructured documents can be 
stored in text file format, binary file format, relational database and object-oriented 
database. 

The application layer is a logical grouping of components that capitalize on the 
category instances. By design of the OSC framework, application components can be 
plugged into the framework as and when they are ready. A possible application com- 
ponent is the search engine. The search engine allows the user to retrieve relevant 
information from the unstructured documents using the category ontology as search 
criteria. 



4.2 Experiment Results 

The main purpose of the experiment is to verify that the Ontology-based Semantic 
Classification (OSC) framework functions as a whole. This includes accessing how 
well the overall system performs when the individual components are integrated to- 
gether through the common interfaces. A secondary objective of the experiment is to 
validate the usefulness of the purposed context model in emulating the categorization 
process of the human being. For this experiment, we chose 33 unstructured docu- 
ments from the American Association for Artificial Intelligence web site 
(www.aaai.org) in various categories. Each document was subsequently converted to 
the text format. 

The experiment was performed in two stages. In the first stage, a human operator 
(a domain expert) was asked to manually categorize the documents into 5 distinct 
categories. Agent, Games, Data Mining, Natural Language Processing and Search. 
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There was no restriction to the number of category that a document could be catego- 
rized. In the second stage, the OSC framework categorized the same collection of 
documents. This involved a knowledge engineer interviewing the domain expert to 
capture and represent his context into the context model used in the OSC framework. 
The result of the experiment is tabulated in Table 1 . 



Table 1 . Experiment Result 
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The table shows that the system correctly identified most categories for the docu- 
ment, but in many cases, it selected multiple categories. In the terminology of infor- 
mation retrieval, recall was excellent, but precision was not so good. The key empha- 
sis of this experiment, however, was not so much on the accuracy of the OSC 
framework, but rather to prove that the respective components can be integrated and 
work seamlessly as intended. The OSC framework indeed performed as intended 
throughout the experiment, categorizing documents with respect to the context speci- 
fied in the ontology. Further experiments with larger data sets and more finely tuned 
parameters are in process. 
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5 Conclusion 

In this paper, we have shown how to utilize user context and preferences through 
ontology in order to classify unstructured documents into useful categories. We have 
demonstrated the use of a Context-based Free Text Interpreter (CFTI), which per- 
forms syntactical analysis and lexical semantic processing of sentences, to derive a 
description of the content of the unstructured document. Direct and indirect mapping 
relationships exist among vocabularies used by ontologies and vocabularies used by 
natural languages. The capture and utilization of these relationships is key to the de- 
velopment of natural language processing systems. 

The quality of classification of unstructured document is strongly dependent on the 
quality of context models and the accuracy of the interpretation of natural language. 
The Ontology-based Semantic Classification (OSC) framework has been tested with a 
relatively small context model. While an assumption that the system would perform 
similarly when tested with larger-sized models seems valid, conducting such tests is 
the focus of ongoing work, together with the use of the OSC in practical applications. 
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Abstract. Extraction of visual descriptors is a crucial problem for state- 
of-the-art visual information analysis. In this paper, we present a know- 
ledge-based approach for detection of visual objects in video sequences, 
extraction of visual descriptors and matching with pre-defined objects. 
The proposed approach models objects through their visual descriptors 
defined in MPEG7. It first extracts moving regions using an efficient 
active contours technique. It then computes visual descriptions of the 
moving regions including color, motion and shape features that are in- 
variant to affine transformations. The extracted features are matched to 
a-priori knowledge about the objects’ descriptions, using appropriately 
defined matching functions. Results are presented which illustrate the 
theoretical developments. 



1 Introduction 

An Information Retrieval System (IRS) consists of a database containing a num- 
ber of documents, an index that associates each document to its related terms, 
and a matching mechanism that maps the user’s query (consisting of terms), 
to a set of associated documents [1], In the case of multimedia documents, the 
content of the document cannot be directly used by the user of the IRS in the 
query, since matching of multimedia content is not as simple as matching of 
textual terms and features of the content must be used instead. The needs for 
description of multimedia documents’ content have been addressed by MPEG-7, 
the ISO standard for description of multimedia content [10]. A large number of 
MPEG-7 compliant multimedia descriptions are currently being produced. The 
standard defines three kinds of features that comprise the description, which 
are Creation and Usage Information, Structural Information and Semantic In- 
formation. The former regards mostly textual information, commonly known 
as metadata. Structural information expresses a low-level and machine-oriented 
kind of description, since they describe content in the form of signal segments 
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Fig. 1 . The proposed integrated scheme for object recognition, using the shape and 
color descriptors. 



and their properties. On the other hand, semantic information expresses a high- 
level, conceptual and human - oriented kind of description, since they deal with 
semantic entities, such as objects and events. 

In this paper we focus on a specific task of multimedia content description, i.e 
the detection and recognition of objects being present in a video stream, whose 
dominant characteristic is their motion. The extraction of moving objects in 
video streams and their description with the use of low-level feature matching, is 
a task that emerges in various applications in the fields of video understanding, 
such as content-based retrieval and semantic description of events. This work 
constitutes an integration of three steps for object recognition, revisiting and 
improving existing methods found in literature. The three steps being followed 
are illustrated in Fig. 1 and can be briefly described as follows. The moving 
objects of interest are extracted, with the use of a tracking method proposed in 
[12], which utilizes an active contour (modified Snake) model and the motion 
information obtained by a motion estimation scheme. Once the desired objects 
are extracted, i.e their position and contour are estimated for each frame of 
the sequence, color descriptors are extracted and their shape is appropriately 
modelled and transformed, so that it becomes affine invariant. The final step 
of the overall scheme is the matching of the color and shape descriptors with 
the respective ones of known objects, existing in a database. In the experiments 
presented in this paper, we use three different objects of either the same color 
or the same shape, to verify the performance of the proposed integrated scheme 
in ground-truth examples. More complicated examples of object recognition are 
currently being tested, with the use of a database and an efficient searching 
procedure in terms of complexity. Finally, for more sophisticated applications 
such as the semantic description of events, the motion trajectory of the desired 
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objects is to be utilized, in order to obtain further useful information about 
the objects’ global motion, apart from their instant motion, provided by motion 
estimation schemes. 

2 Moving Object Extraction 

Efficient moving object extraction in real-world conditions is a challenging task 
for the researchers in the fields of computer vision and video processing. In 
modern coding standards, like MPEG-4 and MPEG-7, the term ’video objects’ 
is used to define moving objects in a video sequence. Automatic extraction of 
such objects is by no means trivial, and occlusion is one of most important 
problems. In this paper we implement and extend the work presented in [12] 
for object tracking, in order to support highly textured backgrounds and partial 
occlusion of the moving objects. 

In [12] object tracking is performed utilizing a snake model [8] and the motion 
information obtained in previous time instances, or motion history. Regarding 
the proposed snake model, its internal energy is defined in terms of the local 
curvature and elasticity (distances between neighboring points), whereas the ex- 
ternal energy term is defined with the use of a modified image gradient, replacing 
the commonly used term |VG< 7 */| [7], which introduces noise in the snake mod- 
els. More information about the definitions of the proposed energy terms can be 
found in [12]. 

Before applying the tracking model in the current frame of a sequence, as 
described in the following, we pre-process the image to eliminate noise, with the 
use of an appropriate morphological Alternating Sequential Filter (ASF) [12, 
9]. The modified image gradient used for our purposes is actually a part of the 
Watershed transformation in image segmentation problems [9] and consists of 
the extraction of binary image markers through a morphological geodesic erosion 
reconstruction of the image gradient, and successive morphological conditional 
erosions of these markers, so that they constitute the only local minima of the 
image gradient. 

2.1 Motion Estimates Extraction 

The correct extraction of moving edges in terms of position and direction is im- 
portant and aids the accurate estimation of an object’s position from the current 
to the next frame. Several existing techniques are able to adequately cope with 
the difficult problem of optical flow recovery given that their assumptions hold. 
The challenge is to achieve high robustness against strong assumption violations 
commonly met in real sequences. We adopt the motion estimation technique 
proposed by Black et al. [4] as an efficient tool for overcoming these violations. 
They reformulate the objective function, which consists of the optical flow equa- 
tion and the spatial coherence constraint, in order to include the robust statis- 
tics tools [6] in an almost straightforward way. They simply take the standard 
least-squares formulation of optical flow and use a robust estimator instead of 
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(a) (b) (c) (d) 



Fig. 2. Tracking method in steps: (a) object contour in the previous frame, (b) snake 
initialization in the current frame, (c) uncertainty region, (d) object contour in the 
current frame. 



the quadratic one. This approximation is then minimized using a coarse-to-fine 
(multiresolution) simultaneous over-relaxation technique. The proposed refor- 
mulation results in an area-based regression technique that is robust to multiple 
motions due to occlusion, transparency or specular reflections and compensates 
for over-smoothing and noise sensitivity. 

2.2 Object Tracking 

Given the proposed snake model presented in [12], the first step is to extract 
some regions (a narrow band) around the curve, which are described as uncer- 
tainty regions (Fig. 2). This is achieved by exploiting the motion history of the 
tracked contour (curve points’ motion in previous time instances), estimated 
with the use of the motion estimation scheme proposed in subsection 2.1: the 
previously estimated contour (Fig. 2(a)) is deformed according to the previously 
estimated motion (snake initialization) (Fig. 2(b)) and the standard deviation 
of each point’s mean motion is calculated; the uncertainty region around each 
point is then the region in the normal direction to the snake initialization, whose 
width is defined according to the corresponding standard deviation (Fig. 2(c)). 
The next step is to find the new position of each point of the curve, inside 
its corresponding uncertainty region (Fig. 2(d)): instead of following an energy 
minimization procedure, using a dynamic programming algorithm, we adopt a 
force-based approach, which reduces the computational cost but also avoids the 
point correspondence problem between different time instances. 

According to that approach, energy terms are converted into forces and the 
final solution is obtained by minimizing the resultant force [12] inside the ex- 
tracted uncertainty regions. The internal forces deform the snake to a shape 
similar to the previously estimated object contour, whereas the external term 
forces the snake towards the object boundaries, inside the extracted overall un- 
certainty region. Thus, the energy minimization is approximated by using these 
forces, in an iterative manner similar to the steepest descent approach [5]. 

The resultant force applied to each snake point is given by the weighted 
summation of the internal and external forces. The respective weights are au- 
tomatically estimated [12], whereas their estimation accuracy is not crucial for 
the final results. The final object contour is obtained when one of the following 
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criteria is satisfied: (a) if the resultant force is smaller than the one of the next 
iteration, or (b) the maximum number of iterations is reached. It must be noted 
that the use of the proposed steepest descent approach does not ensure that the 
final contour corresponds to the solution of the energy minimization problem, 
but under the constraints we pose, even if the final contour corresponds to a 
local minimum, it is close to the desired solution (global minimum). 

In order to separate background and object regions, especially when the 
background contains strong edges close to the object boundaries, as well as to 
cope with moving object’s partial occlusion that may occur, we introduce two 
additional constraints that each detected edge point must obey, so that we can 
decide whether this edge belongs to the desired boundary; all candidate edges 
are indicated by the snake’s external energy, and consequently by the modified 
image gradient. 

Without loss of generality, we suppose that the background is static and 
possible occluding objects are also static. If pk is a detected candidate (possible 
boundary) edge pixel, and pi and p m are the neighboring pixels in both sides 
of Pk, in the normal direction to the snake initialization (Fig. 2(b)), then (a) 
Pk must divide that line segment in two parts: an immiscibly moving and a 
immiscibly static one, that is u(pi) ~ u(pk) and u(p m ) — 0, and (b) pk must 
be a moving point with velocity close to the mean velocity of the object region, 
that is u(pk) — Uobject', u(') and u 0 bject denote the instant velocity and the object 
mean velocity, obtained by the motion estimation scheme described in 2.1. 

Thus, taking the above constraints into consideration, we overcome cases 
such as (a) when the maximum is found in background: it is not a moving one 
and does not separate two immiscible (according to the motion) parts of func- 
tion g m [12], (b) when the maximum is found inside the moving object region: 
although it is a moving one, it does not divide function g m in such two parts, (c) 
when occlusion occurs and the maximum is on the occluding object boundary: 
the maximum is not moving, although it separates the uncertainty region and 
(d) when occlusion occurs and the maximum is in the occluding object region: 
neither the maximum is moving, nor it makes such a separation. In these cases, 
where these two constraints are not reached, we ignore the external force and the 
curve evolves according to its internal forces; in this way, we can obtain contours 
similar to the ones in the past frames. Fig. 3 illustrates the performance of the 
proposed method in a case of two moving objects: one getting partially occluded 
by a static obstacle and the other moving in front of it. The adopted motion 
estimation technique allows the utilization of the two rules described above, in 
order to separate the moving objects from the static regions (background and 
obstacle) of the scene. 

3 Visual Descriptors 

In the following, some visual descriptors, which have been introduced in the 
integrated scheme, are briefly revised according to the MPEG-7 framework [10]. 
The Dominant Color descriptor, illustrated in the experiments of this paper, is 
presented in more detail in subsection 3.1. 
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Fig. 3. (al)-(a4) Motion estimation results and (bl)-(b4) the respective tracking results 
for a case of two moving objects. 



Dominant Color. The dominant color descriptor specifies a set of dominant colors 
in any arbitrary shaped region. The extraction algorithm takes as an input a set 
of color values and quantizes the image color vectors based on the Generalized 
Lloyd Algorithm (GLA), as described in Section 3.1. 

Region Contour. As contour shape descriptor an affine-invariant normalization 
of the extracted object contours is used, as described in Section 3.2. 

3.1 Color Descriptor 

The Dominant Color descriptor used in our experiments to illustrate color match- 
ing of visual objects is described in more detail below. This descriptor provides a 
compact description of the representative colors of an image or image region. Its 
main target applications are similarity retrieval in image databases and browsing 
of image databases based on single or several color values. The representative 
colors can be indexed in the 3 D color space, which allows for efficient indexing 
of large databases. In its basic form, the Dominant Color descriptor consists of 
the number of dominant colors Nu, and for each dominant color its value is 
expressed as a vector of color components Cj and the percentage of pixels pt in 
the image region of the corresponding cluster [10]. 

In order to compute this descriptor, the colors present in a given image 
or region are first clustered. Instead of the Generalized Lloyd Algorithm [10], 
the extraction procedure uses a fuzzy c-means algorithm [3] for the dominant 
color, to divide the set of pixel values corresponding to a given image region 
into clusters in the color space. The algorithm minimizes the supremum of the 
distance between the color pixel values and the representative color vectors using 
the global distortion measure J defined as 

N c Ni 

J =EE u b n^-«bii 2 , a) 

3 = 1 *=1 

where N c is the number of clusters, Nj is the number of pixels of the j-th clus- 
ter, Xi is the i-th color vector, Cj is the center (representative color) of the j-th 
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cluster and /q.j is the degree of membership of a ;» in the cluster Cj. The proce- 
dure is initialized with a predefined number of clusters Njj whose representative 
colors are computed as the centroid (center of mass) of each cluster. Then, the 
algorithm follows a sequence of centroid calculation and clustering steps until 
a stopping criterion (minimum distortion or maximum number of iterations) is 
met. 



3.2 Shape Descriptor 

As shape descriptor we use an affine-invariant normalization of the object con- 
tours extracted by the tracking algorithm described in Section 2. The obtained 
contours are first re-sampled so that they constitute of a fixed number of equidis- 
tant points, also preserving their original shape. In the following, we describe the 
normalization method that transforms the object contours in order to make them 
affine invariant, and thus appropriate for contour matching and recognition [2]. 



Curve Othrogonalization. The proposed procedure normalizes a curve with 
respect to possible translation, skewing, and scaling, and affine transformation as 
rotation or reflection. Let Ci = [ X* , yi] T , * = 0,1, .., N — 1, be N curve points ob- 
tained by the tracking algorithm. A2xN matrix notation C = [Co, Ci , ..., Cn- i] 
is used to represent the points, while their horizontal and vertical coordinates are 
represented by x = [xo, £i, •••, xjv-i] and y = [*/o> Vi, •••> J/jv-i]. For each curve 
C, the (p, g)-order moments 



m pq ( C) 



JV-l 



N 



£W»!> 



z=0 



(2) 



of order up to two are used for the construction of the normalized curve n a ( C). 
A set of linear operations (translation, scaling and rotation) in the curve are 
computed during the orthogonalization procedure: 



1. The center-of-gravity of the curve is normalized so as to coincide with the 
origin: 

xi = x - jixi yi=y ~ Hy ( 3 ) 

where y, x = m 10 {C), \i y = m 0 i(C). 

2. The curve is scaled horizontally and vertically so that its second-order mo- 
ments become equal to one: 



X2 — &xXl 5 V 2 — ®yV 1 



where a x = , =. a v = , =, 

yj m2o(C'i) yj rriQ2{C\) 

3. The curve is rotated counterclockwise by = f as follows : 



c 3 = • C 2 



1 

71 



Xi - y -1 

X 2 + yi 



( 4 ) 



( 5 ) 




Intelligent Visual Descriptor Extraction from Video Sequences 139 



4. The curve is scaled again, exactly as in (2): 



X'4 — Wt',3 , 2/4 — Ty2/3 



( 6 ) 



where 



V™2o(^3) 



> Ty — 



\J m 2 o(C3) 



The normalized curve n a ( C) = C4 can also be written as 



n Q (C) = JV(C)(C-/i(C)) 



1 


T x 0 




A -1 




<J X 0 




X - 


71 ' 


° Ty_ 




1 1 




° Gy 




;y - v v _ 



where /u(C) = [mio(C) ?noi(C)] T and N( C) denotes the 2x2 normalization 
matrix of C. It can be seen in [2] that for each initial curve C, the normalized 
curve n a ( C) defined in eqs. (2)-(6) has the following properties: 



mi 0 (n a (C)) = moi(n a (C)) = mn(n a (C)) = 0, 
m 2 o(n a (C)) = m 02 (n a (C)) = 1 



(8) 



The term orthogonalization is justified since these conditions are equivalent to 
n a ( C) • n a (C) T = I. Let us now consider two curves C and C' related through 
an affine transformation: 



C' = A ■ C + t 



x' 




a b 




X 




y' 


— 


c d 




y 


+ 



(9) 



where matrix A is assumed to be of full rank. Then, C[ = C' + /z(C') = A(C — 
p(C)) = A ■ C\ and translation is removed. Moreover, when a normalized curve 
is rotated or reflected, in which case A is orthogonal, it remains normalized. It 
is thus shown in [2] that there exists an orthogonal 2x2 matrix Q such that: 



n„(C') = Q ■ n a { C) 



(10) 



This means that affine transformations are reduced to orthogonal ones that may 
contain only rotation and/or reflection, depending on whether det(Q) = 1 or 
det{Q) = —1. Therefore normalized curves are invariant to translation, scaling, 
and skew transformations. Note that normalization is performed without knowl- 
edge of the affine parameters A and t, and without one-to-one matching between 
curves C and C'. In addition the transformation parameters (/j , x , fj y , a x , a y , 
t x , T y ) along with n a (C) contain all information on the original curve C. 



Starting Point and Rotation Normalization. The starting point normaliza- 
tion procedure is based on the Discrete Fourier Transform (DFT) of the complex 
vector z = x + jy = [zoZi...Zn-i] t which is used here for curve representation, 
where Zi = Xi + jyi, i = 0,1,..., TV — 1, denotes a single curve point. The DFT 
of the curve 2 is given by: 

N-l 

M = ^2 W ~ kl 

i = 0 



U = 






(ii) 
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j2n 

where w = e~n~ , so that win = 1, l € Z. Employing the primary argument, or 
phase ak = Arg[uk] we construct the phase vector. Consider now a second curve 
z' = [z' 0 z[ . . ■z' n _ 1 ] t that is circularly shifted with respect to z by m samples, 
where m € 0, 1, . . . , N — 1. 

z! = S m (z) = [z\ = Z( i+m ) mod iv| * = 0,1,...JV- 1] (12) 

In order to normalize the curve, a standard circular shift is defined using the 
first and last Fourier phases: 

N N 

p(z) = [~r («i - aiv-i)] mod — (13) 

47 r Z 

and the opposite shift is applied to normalize the curve: 



n p (z) = S_ p{z) {z) 



(14) 



It is shown in [2] that the above normalization is invariant to starting point. 

Rotation normalization is achieved by setting the phases of u\ and un-i to 
zero, so that the became real and positive. Assume that two curves C and C' 
have been orthogonalized and normalized with respect to their starting point, 
thus satisfying eq. (8) . We then uniquely decompose matrix Q as 



qn 912 




cosO 


—sin6 




s x 0 


921 922_ 




sinQ 


cosd 




0 Sy_ 



where 6 € [0, ir), s x = ±1 and s y = rfcl , in order to denote a one-to one relation 
between rotation/reflection parameters and elements of Q. Adopting the complex 
vector notation z, z', 

z’ = (s x x + js v y)e je (16) 

The rotation curve z is normalized according to the average value of Fourier 
phases ot\ and ajv-i: 



r ( z ) = + ^iv-i)] mod 71 ( 17 ) 

zj. = z- e Mz) (18) 

Horizontal and vertical reflection is normalized according to the third-order mo- 
ments of Z\\ 



v(z!) = v x (zi) + jvy(zi) = sgn[m\ 2 {z\ )] + j ■ sgn[m 2 i(z 1 )} (19) 
n r (zi) = z 2 = v x (zi)xi +j ■ v y (zi)yi (20) 

where sgn[- ] denotes the signum function. It is then proved in eq. (3) that n r { C) 
is invariant to rotation and reflection transformations: 



n r (z') = n r (z) (21) 

As in curve ortlrogonalization, the set of parameters r(z),v x (z),v y (z) together 
with n r (z) contain all information about the original curve z. Combining all 
the above results, it is proved that the curve n r (n p (n a )) obtained by the entire 
normalization procedure is invariant to any affine transformation. 
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4 Object Matching 

Once visual descriptors have been extracted for each detected moving object, 
these are employed to perform matching with existing objects stored in a data- 
base with similarly computed visual descriptors. Matching functions are defined 
for this purpose, for each visual descriptor. In the following, the matching pro- 
cedure is described for the color and shape descriptors defined in the previous 
section. 



4.1 Color Matching 

Matching of visual objects using color descriptors is based on mean color vectors 
and dominant colors. More specifically, for mean color vectors, we use the RGB 
information corresponding to the extracted moving objects of interest. The color 
values of the region defined by the estimated object contour are normalized in 
the interval [0,1], and the respective mean values (r,g,b) are calculated. Thus, 
each extracted object is described by the mean color vector m = [r, g, b] . The 
color matching criterion between two objects with mean color vectors mi and 
rrij , respectively, is then, 



d mc = || mi- rrijW = yj (n - r^) 2 + (g t - g^ 2 + (b t - bj) 2 (22) 

This criterion is actually the mean square error between the two color descriptors 
m.i and nij. Dmc is used in our implementation with adequate results, taking 
into account the mean color vector of the objects in one or more frames: for more 
accurate results, in case of external lighting changes along time, we calculate 
the mean value of the vector m in successive frames, and then calculate Dmc, 
according to that value. 

For the dominant color descriptors, the matching function used depends on 
the components present in the query and target descriptors. The basic matching 
function Doc between two objects i and j uses only the percentages and color 
values and is defined as follows 

Ni Nj Nt Nj 

Doc = Ep* + Ep^EE ^HikJlPikPjl , (23) 

k - 1 1=1 k = 1 1=1 



where Pi and pj correspond to query and target descriptors, and aikji is the 
similarity coefficient between two colors Cik and Cjp. 



Q'ikjl — { 



1 - 



, dik,jl — T f J , 

, dikji > Tfj , 



(24) 



where d,k t ji = || Cik — Cji || is the Euclidean Distance between two colors Cik 
and Cji, Td is the maximum distance between two colors considered as similar, 
and d m ax — olT^ol > 1. This distance can be modified to take into account 
the optional variance. One can then take a linear combination of the spatial 
coherency and the above distance to give a combined distance as suggested in 

[io]. 
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4.2 Shape Matching 



Once object contours have been normalized and are invariant to affine trans- 
forms, the most common way to measure the similarity between curves is the 
Euclidean distance. Another way to measure the similarity between curves S{, 
Sj is the cross-correlation criterion, which is defined as 



Ds = p{si,Sj) 



2^k = o Sik ' s i k 



2 . /v^Af-1 2 

Y 0 *ifc Y ^ k =0 jk 



(25) 



where is the fc-th point of curve Sj. The cross-correlation is a normalized 
measure, which denotes how similar two curves are, and indicates a metric of 
their content similarity. 



5 Experimental Results 

In this section we verify the efficiency of the proposed integrated scheme shown 
in Fig. 1, in two video sequences representing three cases of object recognition. 
In the first sequence a silver car is in motion, and it is successfully extracted 
following the method presented in Section 2. In the second sequence two vehicles 
are in motion and thus tracked: a car of the same shape with the one extracted 
in the first sequence, but of different color (green) , and a truck (different shape) 
of the same color with the car of this sequence. Thus, we are called upon to reach 
the three following conclusions: (a) the proposed scheme performs very well even 
when the object contours are extracted with variations from the ground-truth 
(actual contours), or when their shape is deformed due to the projection of 
their motion; to verify this assumption we use the same object in different (non- 
successive) frames of the same sequence, (b) the two cars of the two sequences 
are of the same type but they are not of the same color, and (c) in the second 
sequence, the two moving objects are different in terms of shape, and thus there 
is no need to proceed in color matching to decide whether they are similar. 
Since the integrated scheme provides efficiency in the above three cases, the 
authors are currently working on the construction of an object database and a 
low-complexity searching procedure in that database. 

Fig. 4 illustrates the performance of the tracking method described in Section 
2, where the moving object of interest is the silver car. The object’s contour is 
extracted in four non-successive frames, and it is used for both the shape and 
the color matching procedures. In this example, the efficiency of the proposed 
matching algorithm is verified in two frames (Fig. 4(a), (d)) of the sequence. The 
contour of the car consisting of 100 sample points is illustrated in Fig. 5 for each 
frame. The algorithm’s efficiency is based on the affine transformations, following 
the proposed normalization steps described in Section 3.2, as shown in Fig. 6. It 
can been seen that the final curves match very well, although normalization of 
each curve is performed without the knowledge of the other. The cross-correlation 
between these two curves is p = 0.9995 (~ 1), which indicates that these two 
contours very similar. 
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(b) (c) (d) 

Fig. 4. Tracking example in four frames of a sequence. 

Frame Contour Matching 




Fig. 5. Affine invariant contours obtained for the same object in two different instances. 




(a) (b) (c) 



Fig. 6. (a) Curves after scaling normalization, (b) curves after rotation, and (c) starting 
point normalization. 



In the next example, illustrated in Fig. 7, two sequences containing objects of 
different colors and with similar contours are presented. The respective tracking 
results are shown in Figs. 3 and 4. The contour transformations, proposed in 
Section 3.2, result in similar contours as shown in Fig. 7, which indicates that 
the two cars are of the same type. This is also concluded numerically, using the 
cross-correlation between the contours of the silver and the green car, which in 
this case is p = 0.9988; the value of that measure is close to 1, which indicates 
that these two cars are of the same type. 
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Frame Contour Matching 



Fig. 7. Sequences which contain objects with the same shape but different color. 
Frame Contour Matching 



Fig. 8. Sequences with different objects of the same color. 





In the final example, two objects with different shape are examined, whose 
dominant colors are similar, as shown in Fig. 8: green car and green truck ex- 
tracted in Fig. 3). In such cases, depending on the application, we conclude 
either that there is no need to proceed to color matching, since the two shapes 
(and consequently the two objects) are quite different, or that their dominant 
colors are similar (if we are interested in objects of the same color). The contour 
normalization results, illustrated in Fig. 8, show that the two contours are quite 
different, whereas the cross-correlation between these two contours is p = 0.6586. 

Fig. 9 illustrates the color clustering results for the three objects examined: 
(a) green car, (b) green truck and (c) silver car. For each object, four color 
clusters are estimated along with the respective centers. It must be noted that 
the colors shown in the 2>D graphs do not represent the true colors corresponding 
to the clusters, but are used for representation purposes. 
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(a) (b) (c) 



Fig. 9. Color distributions for the three objects, in the RGB space, after clustering: 
four color clusters for each object have been estimated, whereas the center of each 
cluster is also illustrated. 

Table 1. Color matching results (eq. 22) for the three moving objects of the examples 
illustrated in Figs. 3 and 4. 



objects 


D 


Ddc 


silver car - green car 


0.3692 


0.2564 


silver car - green truck 


0.3430 


0.2572 


green car - green truck 


0.0520 


0.0447 



Finally, for the three extracted objects of the previously described examples, 
the color matching results are illustrated in Table 1. As can be seen in the 
last row, two of the objects (a car and the truck) are similar in terms of color 
( Dmc — 0.05, Djjc — 0.04), whereas the matching between the silver car and 
the other two objects leads to values of D M q > 0.3 and D^c > 0.2. 

6 Conclusions and Further Work 

In this paper an integrated scheme for moving object extraction and recognition 
is proposed, aiming at the detection of objects of specific shape (contour) and 
color. In this direction, three different methods of the literature are revised, 
extended and integrated together: (a) moving object tracking, (b) contour affine- 
invariant normalization, and (c) dominant color extraction. After following these 
three steps, we decide on the similarity between two (or more) objects, according 
to appropriate criteria. In this work, we test the proposed integrated scheme in 
three simple examples, where the ground-truth is available; this is mainly done 
to verify our assumptions. We are currently working on extending this scheme, 
using an appropriate database of real-world sequences, for object-based video 
retrieval. 
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Abstract. This paper deals with the use of a model dedicated to hu- 
man motion analysis in a video. This model has the particularity to be 
able to adapt itself to the current resolution or the required level of 
precision through possible decompositions into several hierarchical lev- 
els. The first level of the model has been described in previous works: 
it is region-based and the matching process between the model and the 
current picture is performed by the comparison of the extracted subject 
shape and a graphical representation of the model consisting in a set of 
ribbons. To proceed to this comparison, a chamfer matching algorithm 
is applied on those regions. Until now, the correspondence problem was 
treated in an independent way for each element of the model in a search 
area, one for each limb. No physical constraints were applied while posi- 
tioning the different ribbons, as no temporal information has been taken 
into account. We present in this paper how we intend to introduce all 
those parameters in the definition of the different search areas accord- 
ing to positions obtained in the previous frames, distance with neighbor 
ribbons, and quality of previous matching. 



1 Introduction 

By the expansion of applications such as surveillance systems, user interface, or in 
more cultural domains, sport or dance motions analysis, study of human motion 
in video sequences has become a domain in wide expansion. The main objective 
is to be able to identify some predefined gestures in the motion description 
provided by this method. This identification step will depend on the reliability 
of results produced by the analysis step, and the ability to recognize on this 
description a same gesture shot from different view angles. There are two major 
applications in the field of video content indexing for these works: we intend first 
to be able to identify segments in sport events where a specific gesture occurs; 
we also want to apply this tool to video surveillance in order to automatically 
detect some given human behavior in order to launch an alarm on purpose. 
All systems differ considering the means involved to achieve this analysis. They 
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depend directly from the application itself, its context, and some requirements 
imposed by user needs. The way human body is modeled is one of the most 
critical parts, and in our case, we will have to deal in real time with a video flow 
of low quality. Therefore, the model we propose can adapt itself to various video 
qualities and can be used to produce motion descriptions at different levels of 
details. 

Different kind of human models exist (as for example H-Anim from MPEG-4 
[5]), meanwhile they are essentially derived for synthesis purposes and then are 
not in total adequacy for an automatic analysis tool. 

In this paper, we propose a multi-level model for the human body and we 
discuss about the way to impose physical constraints on it in order to reinforce 
results. The principle is to define only one model for any kind of document. It 
can adapt itself to the current resolution or precision level required by a user, 
avoiding by this way useless computation and prevent from false detection as it 
may happen when the model is defined with more accuracy than what it is really 
possible to extract from the video flow. To achieve this, the model is composed 
of ribbons, each of them being decomposable in sub-ribbons, implying a descent 
into the hierarchical levels. Thus we can only use the level corresponding to the 
document resolution or to specific needs. The matching process between model 
and frames extracted from a video consists in a chamfer matching algorithm 
between model components and a distance map. This map is obtained from a 
distance transform applied on the subject image, but this step has to be preceded 
by a decomposition of the image in search areas for each component of the model. 

Defining a model in this hierarchical way has two advantages. First, dealing 
with real time processing becomes possible as a result even at a coarse degree can 
always be produced. The obtained precision will depend on the time allowed to 
perform computing. The more time being available the more results are accurate. 
The second advantage is the possibility to adapt the model to application needs. 
Users can specify the required degree of precision and the appropriate model 
decomposition is used in consequence, avoiding useless computation cost. 

The definition of search areas is a critical point in the process. At the begin- 
ning, they are estimated according to the most likely position of the limbs in the 
image. As a first match with the model has been performed, we intend to refine 
the location of those search areas by taking into account matching quality, and 
introducing physical constraints on the positioning of the different elements in 
the image. 

2 Related Work 

This section has for objective to give a rapid overview of related works, and 
justify in the same time the proposed approach. As mentioned before, human 
models are often inspired from image synthesis. In this domain, the model de- 
veloped by MPEG-4 [5] is composed of “sticks” linked by “joints”. But that 
kind of modeling presents some limitations because it is based on the idea that 
articulations are centers of rotations. It has not been designed for analysis pur- 
poses, which makes its use difficult in this context where limitations on possible 
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subject postures must be imposed. It seems difficult to specify constraints on 
for example, rotation angles or even limbs velocity. Nevertheless, we can find a 
use of that kind of model in [7] where a matching of the skeleton obtained from 
a subject shape is performed. Another kind of modeling has been proposed in 
order to derive directly the model from original images themselves. It is the case 
with statistical approaches. We can distinguish among them the use of spatial 
and color distribution of pixels in order to realize the image segmentation [9]. 
But in that case, motion description depends on environment conditions and can 
often provide a too rough description, difficult to employ in more general cases 
as we can encounter in multimedia information retrieval. The second kind of sta- 
tistical approaches consists in models which are “deformable” since they have 
the particularity to adapt to the subject images ([3], [8]). By using point distri- 
bution models, the model shape becomes “active”. But this approach requires a 
training step for the matching process to provide good results [4] . 

No modeling proposition integrates a multilevel resolution of the description. 
Furthermore, most of those approaches require some strict conditions on the 
way the analyzed scene has to be shot (conditions on lightening, camera cen- 
tering, etc). This reduces their potential application for generic and real time 
video content analysis. Even if our proposition do still not take into account 3D 
information, truncated shapes, camera motion, or multiple bodies, we think that 
these points can be integrated in further works in a compliant way. Whatever, 
(in the generated description) the reliability of results, which allows trusting 
only in pieces of information of good quality, is already taken into account. This 
point is typical from an indexing approach, which is a new way to address human 
motion analysis. 

3 Previous Work 

3.1 The Hierarchical Model 

As mentioned before, the proposed model is defined in a hierarchical way: it is 
composed of many levels, each of them corresponding to a level of details which 
best feat the video resolution. The principle is to perform a first processing 
step by using the first level of the model providing the roughest results [6]. 
Then, these results can be refined by the application of the second level which 
provides a sharper description of the human body. For elements of the model 
where results are better with a more precise description, the next level can be 
applied, and so on until application of a new level providing results which are 
not more significant that before. The final level will be the one which feat the 
best the resolution of the video, the level of detail which is really extractible. 

Possible applications can be set in three different groups. The first case where 
no knowledge about the studied document is supplied and no real time process- 
ing is required. Then a descent into the hierarchical levels is performed and best 
matching selected. This approach yields when a temporal constraint is intro- 
duced. In that case, the accuracy of the description is limited by computation 
time, but as we said, the proposed model is by its conception compliant with 
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real time processing, and then, a result (even coarse in comparison with what 
is really extractable from the video) can be produced. This property can not 
be offered by a non evolutive model where all the components coordinates are 
required to define a subject position. The third possible case is when the user 
whish to specify the accuracy level of the application. For example, in the same 
video, a sharp description of the motion of the arms and a coarser of the legs 
may be required. Then, the model can be composed of high level components 
for the top of the body, and low level ones for the rest of the limbs. In that last 
case, an adaptation to specific needs is achieved. 

Considering those orientations, we are able to give a definition of our model 
which is graphically based on regions, and this feature is essential for the forth- 
coming process which relies on this characteristic. Region-based means that the 
surface covered by one element of the model has to match with the corresponding 
subject limb in the image, the model being composed of a set of ribbons, each 
of them is associated with one part of the human body. The evolutive aspect 
of the proposed model stays in the fact that those ribbons can be decomposed 
in sub-ribbons, and by this way the model has the possibility to be adapted to 
the required level of precision. Defining our model by simple elements as ribbons 
and sub-ribbons instead of more developed features as edges for example, has 
the advantage to require a simple description. Indeed, an element of the model 
can be defined by two parameters, its length and width. The localization in the 
image is realized by also two parameters: the coordinates of a control point and 
an angle value which corresponds to the orientation of the concerned segment. 
Thus, an element of the model is totally defined by a couple of parameters to 
generate it and a couple of parameters to localize it in the image. Obviously, 
increasing the resolution (implying a descent in the model hierarchy) will raise 
the number of elements and in the same way, the number of parameters required 
to define the posture of the studied subject. Based on this idea of multi-level 
model, we propose 3 different descriptions (see Fig. 1). The first one is quite 
basic since it is composed by only 5 ribbons, one for each limb and one for the 
head-torso set. This representation is rough and most limbs configurations will 
not be able to be described in a precise way. However, a first description of the 
posture can be given by this kind of modeling and may be sufficient for some 
applications which do not require a good precision. On the second level, all those 
5 ribbons split up at the joints area and allow detection of a more important 
number of limbs configurations such as, for example bending. Finally, a third 
level has been proposed. By using this one, it is possible to distinguish motion 
of hands and feet by splitting up the model into components corresponding to 
the forearms and the down part of the leg. The choice of this decomposition into 
three different levels has been made according to what seemed to be feasible 
from different image resolutions. Of course, many other configurations based on 
this idea of a hierarchical model are possible. 

3.2 Model/Subject Matching 

Our model being defined, we have to describe methods used to match the differ- 
ent elements which compose the model and extracted video features. The subject 
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Fig. 1 . Graphical rendering of the proposed model. From left to right: the three hier- 
archical levels. 



posture is indicated by the obtained positions for those elements. The first step 
of our process is to match the first level of the hierarchical decomposition which, 
in all cases, provides first information about the limbs localization and can be 
employed during next steps when more refined levels of the model are involved. 
By this way, using a coarse definition for the model specifies the search areas 
and then provides a reduction of the computational cost for the application of 
the following levels of the model. 

Before the model/subject matching step, the first operation consists in the 
extraction of features from the video flow which are compliant with the region- 
based model. This is realized by preprocessing a background subtraction followed 
by morphological operations in order to obtain an image of the subject silhouette. 
Then boundary boxes are created, giving a first reduction of the search field by 
keeping only regions of interest, and through their length/ width ratio define the 
size of elements composing the model. 

The problem is now to determine the correspondence between those silhou- 
ette images and the different elements of the model. We use for that a chamfer 
matching algorithm. Its principles are exposed in ([1], [2]). The method consists 
in the construction of a distance map computed with the chamfer distance on 
the binary image of the shape where the searched feature is located. Then a 
distance between the image of the feature and this map is evaluated and allows 
to know if the proposed position of the model is acceptable or not and, in the 
negative case, to estimate the difference value. The distance transformation, i.e. 
the conversion of a binary image into a distance map, is processed by using the 
3-4 Distance Transform defined by Borgefors [2] allowing a good approximation 
of the chamfer in only two passes over the image. 

The distance map being created, each element of the model has to be used 
as a mask over this map and the root mean square of the distance values located 
under the mask is computed. The r.m.s. is the difference measure presented by 
Borgefors as being the one providing the less false minima among other existing 
average measures. The search of all the components positions is performed in a 
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sequential way: for several successive values of the model parameters, only the 
ones corresponding to positions providing the minimum r.m.s. value are selected. 
Those parameters are spatial coordinates (a;; y) of the segments and the value of 
the angle with one of the two axes. Thus, the search for minimum r.m.s. has to 
take into account all possible orientations of segments in addition to their spatial 
coordinates. The implemented algorithm (see Fig. 2) begins with a predefined 
initial position and orientation. Then, a matching step using only translations is 
processed, and from the new determined position, a rotation of ± j around the 
center of the intersection area between the component image and the subject 
silhouette is applied. For those angle values, a new matching by translations is 
performed. We select the position with the lower r.m.s. value among the initial 
and the two new ones. In order to refine results, this process is reiterated by 
using for the rotation an angle equal to the preceding one divided by 2. This 
time, the comparison is processed between the previously computed position and 
the two new ones coming from this last step, and only the best one is kept. This 
operation is repeated 6 times, corresponding to a rotation angle value equal to 
±-jfg, which provides a low error at the pixel scale. The choice of f for the first 
rotation is due to the symmetry of the model components which ensure that even 
by limiting the angle to this value all possible orientations will be explored. The 
algorithm convergence is based on the fact that matching error is minimal when 
a model segment is globally oriented in the same direction that the searched 
subject limb. Matching by only translations is a process that reduces effects 
coming from the potential presence in the neighborhood of pixels which do not 
belong to the area of interest. The general implemented algorithm is described 
in Fig. 2. 

The chamfer matching algorithm has the disadvantage to lead to potential 
false detections when the initial position is too far from the optimal one. To avoid 
this kind of situation, we define a search area for each element composing the 
model (this step corresponds to the box labeled “Image cutting” in Fig. 2). As 
no a priori information about subject posture is available, the current image is 
cut out according to the most probable location of subject limbs. The definition 
of those areas presents the advantage to reduce the search field, and then, in 
addition of avoiding local minima, provides a gain in terms of computational 
cost. On the other hand, if the searched limb is not located in the affected 
search area, it becomes impossible to localize it. Thus, the way areas are defined 
is a critical point of the system. 

3.3 Experimental Results with Non-evolutive Search Areas 

In this section, a few matching results using only the first level of the model 
and non-evolutive search areas are presented. The search areas are called “non- 
evolutive” because their definition has been made according to the most likely 
localization of the limbs in the image at the beginning of a sequence and do not 
take into account some information coming from previous matching steps (see 
Fig. 3 for initial image cutting details). 

Even if the first level of the model offers some restricted possibilities, match- 
ing results are not so far from the real subject posture. Some differences appear 
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Fig. 2. The general algorithm from a frame to model coordinates. In two passes over 
the image, all preprocessing steps and a distance transform can be computed. They 
are followed by an algorithm aiming at matching with translations and with rotations. 
Translation process is performed until the translation factor (cx;cy) is different from 
zero, and rotation process is as for him repeated until the angle value is greater than 
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Fig. 3. Subject silhouette obtained after preprocessing. Initial search areas for respec- 
tively torso, right arm and right leg are delimited by a white line (within the image) 
and black ones (outside). Fields corresponding to left arm and leg are the symmetrical 
ones. 



when for example, some pixels that do not belong to the concerned limb are 
located in the search area, or when there are some “holes” in the subject silhou- 
ette that preprocessing is not able to fill. Obviously, some limitations about the 
limbs configuration of the subject are imposed by the high degree of this first 
level, and then some postures can not be analyzed. In addition to that, many 
problems of analyzing a human motion come from the choice of a 2D model and 
of only one camera. Some postures where depth of field is important are difficult 
to describe because the 3D information is missing. Only data about motion from 
a global point of view could allow a description of subject postures in that kind 
of situations. 

In order to evaluate the quality of matching, a measure has been proposed. 
Its objective is to provide an indication about the validity of a positioning for 
each element that composes the model. The used formula is: 

q 2 * Nbi" of pix. in the area of real image C model component matched ^ 

Nbr of pix. model component 

The obtained value for Q\ is between -1 and +1; -1 is a non significant 
matching that can not be employed, and +1 on the contrary a reliable one, useful 
for next pictures study. It is important to precise that this measure provides an 
information on only the surface covered by a model element, and do not take into 
account the fact that the detected limb is the searched one or not. This knowledge 
can only be brought by specifying constraints on the distance between different 
model components, and evaluating the validity of determined posture. 

On the example of silhouette given by Fig. 1, we can notice that because of the 
thresholding realized in the preprocessing step, some part of the subject have not 
been well highlighted and some “holes” appear within the silhouette. However, 
those pixels are not taken into account to compute a quality of the matching, 
whereas they certainly belong to the subject limb. To solve this problem and 
provide a better evaluation of the matching, we propose to compute a coefficient 
Q 2 from the distance map: 
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n ( Values of distance map C model component matched \ 

^ J V 255 * Nbr of pix. model component ' 

where f is a function intended to spread out values between -1 and +1, provid- 
ing by this way a better results interpretation. This function is experimentally 
determined, in order to lead to values which are in agreement with real subject 
posture. 

As Q i, Q 2 takes its possible values between -1 and +1. Normalization is 
processed by considering that 255 is the maximum value fixed for an element of 
the map. This limitation has been determined by taking into account the size of 
the bounding boxes and this value can be seldom reached. In what follows we 
will get use of the letter Q to describe the coefficient of matching quality which 
is a function of the coefficients Q\ and Qi- 



Table 1 . Q \ and Q 2 values obtained for the images given on Fig. 4, each column 
corresponds to an element of the model for each one of the two coefficients. 
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As we can see on the graphical rendering of evaluated postures (see Fig. 4), 
the proposed algorithm produces quite good results when the objective is to 
match a model element with pixels located in a precise area. But this operation 
will provide the detection of given limb only if search areas have been correctly 
defined. Thus, the way those areas are defined is determining, as a limb can not 
be detected if it does not belong to its search field. Obviously, this implies that if 
we do not have a priori knowledge about the performed motion, an initialization 
step during which subject limbs have to cross their respective search areas is 
necessary to be able to begin a tracking process. To achieve this tracking step, 
an evolution of the areas according to previous matching, its quality, and physical 
constraints must be proposed. 

4 Search Areas Redefinition 

At this point of development, the goal is to incorporate into the system some 
knowledge coming from parameters likely to provide some relevant information. 
By today, the matching of an element of the model was performed independently 
from the others. For example, no information about torso positioning was used 
to lead the arms matching, or any other limb. This consideration highlights the 
necessity of modeling the physical constraints inherent to the human body. The 
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(5) (6) 



Fig. 4. Frames extracted from a video. Results of the matching process are represented 
by white lines around the subject. 



evaluation of matching quality has also to be taken into account by the algo- 
rithm. Indeed, the surface we have to cover for a search do not have to be the 
same if previous matching could be considered as excellent, or on the contrary 
was not good at all. Of course, these two parameters (physical constraints and 
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matching quality) are not totally unrelated, and their intended usage must be 
a good compromise between their possible contributions. A possible solution to 
incorporate them into the matching process could have been to modify directly 
the distance map according to them, giving a new value to some map elements 
in order to give some orientations to the search. But obviously it seems that 
the generated computing complexity makes this possibility an inappropriate so- 
lution. Consequently, we propose to incorporate the constraints (of quality and 
also of physical type) to the definition of search areas which is a principal part 
of the system on which relies most of the potential efficiency. 



4.1 Application of Physical Constraints 

We intend to use the fact that, for example, the element corresponding to the 
torso can not be at a given distance from the ones representing arms and legs. 
By this way, a model of tensions involved in the human body can be realized. 
It is important to precise that we do not want to force the localization of an 
element precisely by the torso side (for example), but rather to lead the next 
search in a more probable direction in order to save computation time and avoid 
wrong matching. We propose to realize this operation of bringing towards model 
elements that are supposed to be joined by redefining search areas according to 
previously determined limbs localization. 

The goal is, once a search area has been determined, to move this field to- 
wards the adjacent limb. This operation can be processed in many ways: two 
adjacent segments can be brought to get closer one to the other, or a segment 
can be considered as static, and only one has to move to get closer. The qual- 
ity coefficient Q can be an indication to take the decision of which area can 
be considered as static but only in very restricted proportions because, as we 
mentioned before, it just provides an information in terms of covered surface but 
not about the limb detected indeed. In a first step, in order to evaluate those 
principles on a simple case, we suppose that the segment representing torso is 
the one the more likely to be effectively matched with the searched limb. This 
assumption is made according to the fact that the torso is the only one limb we 
are almost sure to have detected without too much ambiguity when a subject 
is present, even if the localization is not precise in a first step and need to be 
refined. Thus, this model element can be chosen as a reference, and we have to 
move the search areas of all the adjacent components towards it. 

The figure 5 is an illustration of those principles in a simple case. This is the 
result of a first matching. As we said, we suppose that the evaluated position for 
the head-torso set can be considered as quite good (even if it will also have to 
evolve in order to get more accurate). The search area of the arm for example 
has to be redefined, first in terms of surface (this problem being tackled in next 
section), and then in terms of position. To solve this last point, the information 
concerning torso matching in the previous frame is used to transfer area coor- 
dinates towards the corresponding part of the torso. A translation moves the 
search area towards an intermediate position between the original one and the 
torso extremity. We proceed by an interpolation and not a complete translation 
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because the torso localization is not accurate enough (as for the other limbs) and 
may evolve in the time. Proceeding to an interpolation should realize a smooth- 
ing effect on the segment displacement, avoiding on the same time some possible 
oscillations. By this way, the search field evolves towards an area of the image 
in which searched limb is supposed to be located. This redefinition is based on 
a part of the result obtained from the previous matching. Thus, we intend to 
provide a valid redirection for the search fields definition. To determine in which 
proportions an area should be moved, we propose to use the matching qual- 
ity given by the coefficient Q. The proposed formula for the translation factor 
T(x; y) is: 

T(x; v) = q 7 ( r a( X\ y) - P r Or; y)) 

with: 

— Qai Qt coefficients of matching quality for the concerned limb (here the 

arm) and the torso 

— Ra initial position of the limb search area 

— Pt position of the torso segment. 

Torso being considered as a reference for other matching operations, its search 
area has nevertheless to be redefined in order to provide an accurate positioning. 
We intend to realize this operation by a balanced interpolation from positions of 
the four other search areas. Only a translation of the search field is performed, 
the orientation obtained in the previous frame for the element is preserved as 
orientation for its redefinition. This angle represents a general orientation of 
pixels within the area and provides some information that should be used in the 
areas redefinition process. We can notice that restricting the distance between 
joints of adjacent areas and redefining the different search fields as explained 
before limit, in most cases, the surface which may be covered by different areas, 
excepted of course when a limb performs a motion leading to pass behind another 
one, where in that case recovers are inevitable. 

4.2 Application of the Quality Matching Coefficient 

In the previous section, we have discussed about the localization of a search field 
around the determined area of an element of the model, however the surface 
covered by this area has still to be defined. To achieve this, we propose to use the 
coefficient Q representing the quality of a matching to evaluate well-proportioned 
dimensions for the search areas. Indeed, the less pixels are presents in the current 
search field (that means a low value for Q), the more this field has to grow in the 
next frame in order to extend the possibilities to find the limb. On the contrary, 
if Q is high enough, then many pixels belong to the search area and in that 
case, the current matching can be considered as quite reliable. In addition to 
that, if the obtained position for the different elements of the model is conform 
to a human body configuration (actually compliant to physical constraints), we 
are allowed to suppose that the real limb has been detected. Then, to process 
the next frame, it is not necessary to search in a wide area. We can restrict the 
search to the near neighborhood. 
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Fig. 5. Application of physical constraints on search areas definition. From an obtained 
posture (on the left), search areas are first defined and then relocated by using control 
points. New areas after this operation are illustrated on the right. 



4.3 Proposed Parameters 

Two aspects have to be taken into account to redefine a search area: physical 
constraints coming from the human body constitution and up to what point a 
previous matching must influence the next one. We have to propose a mean to 
introduce those constraints into the parametrical definition of a search area. This 
one can be defined from a single point and an angular value (see Fig 6). We use 
an angular sector because this kind of parameter seems to be more adapted to 
motions of model elements, and allows more accurate descriptions. The position 
of the point O and the value a will directly depend on the physical constraints 
and the quality of the previous matching. We propose to set O on the line issued 
from the point located in the middle of a model element width (the part of the 
element nearest to the torso), keeping the segment orientation computed in the 
previous step. Another possibility could have been to fix O on the joint with the 
torso (or near to this point). In that case, the only parameter defining the surface 
area would be a. This last solution provides certainly a simplification of the area 
computation by reducing the parameters number, but it also seems that in most 
of the cases, the final surface obtained by this way will be bigger in terms of 
covered pixels and then will require a higher computational time. Furthermore, 
it would not be able to provide a sharper definition as the proposed solution. At 
last, the non-exploitation of the previous limb orientation represents a certain 
loss of relevant information. 

The distance p between the point M and O (see Fig 6) is function of the 
quality of the previous matching: the weaker Q , the higher p , in order to have 
maximum of possibilities to describe the search area precisely for each different 
cases; p depends on Q. From O, the angular value a will partially fix the surface 
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covered by the search area. This parameter has also to rely on Q by evolving 
higher when this one gets lower: if the matching was good, we do not have to 
define a wide area, and on the contrary, if the matching was not good enough, 
we have to extend this surface to recover pixels belonging to the concerned limb. 
Possible values for a is a function of Q. To close the search area, we still have to 
provide a value for the height h of the area. This is another parameter used to 
define the search area. For the moment, only the quality of matching has been 
employed to describe the search field in terms of surface. Physical constraints of 
the human body will be applied while localizing this search area in the image 
space as seen in previous section. By reducing the distance between the area 
and the obtained position of the model element corresponding to the torso, we 
intend to focus the search in this direction and then on the most likely part of 
the image. Of course, this implies that torso positioning is seen as a reference 
for the other limbs, but it has also to evolve itself by moving its own search 
area as a rectangular surface around it, this area having the same orientation 
in the previous frame. We do not use an angular sector to define this search 
field. Indeed, the motion of the torso do not necessarily requires a very accurate 
description for the search area, because of the size of this limb in the bounding 
box. The temporal evolution of the torso orientation is generally not in the same 
proportions that the one of others limbs (as arm for example), and then do not 
need an expensive computation time. On the other hand, the dimensions of this 
rectangular area have to evolve with the matching quality. The localization of 
this search field has to be processed as described in the previous section. 

In the next levels of the hierarchical model providing a more accurate de- 
scription of adopted postures, the different search areas will be redefined by 
parameters described previously: a single point and an angular value. This time, 
we propose to take as the reference limb the one at the immediate upper level in 
the hierarchy. For example, the segment representing the forearm will have an 
area which depends on the position of other elements composing the full arm. 
Another kind of hierarchy (this time between the limbs) is involved. However, 
this dependency in the areas redefinition should not be a constraint preventing 
from the matching process when a limb of a higher level has not been correctly 
detected. 

We mentioned that in a first step, the orientation of the search area will be 
the same as the one of the concerned limb in the previous frame. This assumption 
comes from the fact that between two consecutive frames, the motion can not 
be fast enough to produce a significant change in the orientation. However, this 
method to determine the orientation can not be applied when only key frames 
are processed, because high variations may happen. This time, motion dynamic 
must be evaluated in order to estimate the new orientation of the search area. 

In order to define the search field, four points (Ri, R2, R3, R4) delimiting 
this area are required. Those point coordinates in the image plan will be directly 
computed from parameters p, a and h described above. They are obtained by 
performing two changes of reference marks: first, O is defined by its polar coor- 
dinates in the reference where the point P is the center. Then, coordinates of 
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Ii \ , R. 2 , R 3 and R 4 in the polar reference mark of center O are computed. At 
last, we are able to give the coordinates of the four points in the Cartesian image 
reference. Obtained formulas are given below: 

Ri '■ ( R\x j R\y) = (Xo + p\sin9\ + P x + T x ; Yq + p\cos9\ + P y + T y ) 

i?2 : ( R 2 X j R 2 y) = (Xo + p2sin0i + P x + T x ; Yq + p2Cos9\ + P y + T y ) 

P 3 '■ (P 3 X j I?3 y) = {Xo + p 2 sin 02 + -Px + T x ; Yip + P 2 COSO 2 + P y + Ty) 

i?4 : (Rax ; T?4y) = (Xo + Pisin02 + P x + T x ; Yq + p\cos02 + P y + T y ) 

with: 

— pf 0 ;F 0 ) = (-^ sin{(3 H - ^ H - (5) 5 sinS cos(/3 “h ^ H - (5)) Cartesian coordinates 
of O in reference mark of center P 

— pi = is the modulus of R\ and R 4 is reference mark of center O 

— 6*i = (3 — j is the argument of R\ and R 2 in the same reference mark 

— P 2 = corresponds to modulus of R 2 and R 3 

— d 2 — (3 + § corresponds to the argument of R 3 and R 4 . 

In addition to this, we have to precise that: 

— L is the length of the concerned segment of the model 

— S is the angle between (OP) and the element (see figure) and is equal to 

atari (■£■) 

2 

— /? is the orientation obtained by the matching process 

— (P x ; P y ) are the coordinates of point P in the image 

— (T x ; T y ) is the translation factor computed according to concepts exposed 
in the previous section 

— I is the element width. 

The coordinates of each extremity of a search area can be computed from 
three parameters which are p, a and h. These parameters are function of the 
matching quality coefficient Q obtained at a previous processing step. 

4.4 Future Works 

Future works will mainly determine in which proportion each parameter (p, 
a and h) has to be used in order to evaluate for all the different cases the 
best size for the new search areas. As mentioned before, the other point of 
work is to establish the relation with the previous matching. Another aspect is 
the temporal evolution of the parameter values. Some oscillations are likely to 
happen, and limitations on possible results should be applied in order to avoid 
those effects, as an interpolation has been introduced in the translation process of 
the search areas with the same objectives. Introduction of other conditions in the 
definition of the search fields may also be considered. For example, an evolution 
taking into account the other areas redefinitions according the most likely ones, 
creating by this way a kind of evolutive hierarchy between the different limbs, 
is a possible way to be explored. Obviously, this will imply to take a decision to 
order segment positions, as we have done by choosing the torso as the reference 
for first matchings. 
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Fig. 6. Redefinition of the search field corresponding to the arm. Coordinates of each 
extremity of the area are computed from the position of the model element. On the 
right an enlargement highlights angles used to perform computations. 



5 Conclusion 

We have proposed in this paper methods dedicated to incorporate some tem- 
poral and spatial information to the refining of limbs localization in a system 
performing a subject/model matching. This system already gave first results of 
matching at the coarsest level of a hierarchical model. Our goal was to propose 
means to provide a more accurate positioning of the different elements. In or- 
der to achieve this, we have first introduced a direct application of the physical 
constraints inherent to the human body through a modification of the localiza- 
tion of the next search fields, these ones being essential to improve the results 
quality. Torso is considered as a reference limb for the other elements during 
this operation. The second step has been to define those search areas in terms 
of covered surface by taking into account the previous matching quality. This 
has been realized by using an angular sector generating a surface according to 
the area to be explored. With these two processes, new positions and sizes of 
the search fields can be defined. The next step of our works will be to provide 
an experimental validation of all the exposed principles and to define in which 
proportion each parameter occurring in new search areas computation has to 
participate to the area definition. 
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Abstract. We outline a method to learn fuzzy rules for visual speech recogni- 
tion. Such a system could be used in automatic annotation of video sequences, 
to aid subsequent retrieval; it could also be used to improve the recognition of 
voice commands when a system has no keyboard. In the implemented system, 
features were extracted automatically front short video sequences, by identify- 
ing regions of the face and tracking the movement of various points around the 
mouth from frame to frame. The words in video sequences were segmented 
manually on phoneme boundaries and a rule base was constructed using two- 
dimensional fuzzy sets on feature and time parameters. The method was applied 
to the Tulips 1 database and results were slightly better than those obtained with 
techniques based on neural networks and Hidden Markov Models. This sug- 
gests that the learned rules are speaker independent. A medium sized vocabu- 
lary of around 300 words, representative of phonemes in the English language, 
was created and used for training and testing. Reasonable accuracy for phoneme 
classification was achieved. Because of the ambiguity and similarity of various 
speech sounds a scheme was developed to select a group of words when a test 
word was presented to the system. The accuracy achieved was 21-33%, compa- 
rable to expert human lip-readers whose accuracy on nonsense words is about 
30%. 



1 Introduction 

There are many possible applications of visual speech recognition (automatic com- 
puter lip-reading) [1], Great progress has been made in the field of acoustic speech 
recognition, but the quality of systems degrades considerably in the presence of noise. 
Environmental noise is a major obstacle in the commercial use of speech recognition 
techniques (see www.research.ibm.com/AVSTG/srec.html). There has been recent 
interest in developing a lip-reading mobile phone, to avoid noise problems in trains, 
open plan offices, etc (www.newscientist.com/news/news.jsp?id=ns99992122). 

In the field of adaptive multimedia retrieval systems, a popular approach is to use 
text annotations (e.g. subtitles) and apply text-based information retrieval techniques 
(see [2] for example). Visual speech recognition can help this approach in a number 
of ways such as automatic generation of annotations (either where no manually- 
created subtitles exist, or to augment manually-created subtitles which may abbreviate 
or paraphrase the actual words spoken). Additionally, visual speech recognition can 
isolate the exact frames where a word or sentence is spoken, aiding accurate retrieval. 



A. Niimberger and M. Detyniecki (Eds.): AMR 2003, LNCS 3094, pp. 164-175, 2004. 
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Fig. 1 . Features used in the speechreading system 



The information regarding speech contained in visual signals is supplementary and 
complementary to the information contained in audio signals, especially in the pres- 
ence of noise [3]. Sounds which are difficult to distinguish in audio signals are easy to 
discriminate in visual signals and vice versa. For example the phonemes b and k are 
difficult to distinguish when only their audio signals are present but easy to discrimi- 
nate on the basis of their lip movements. The opposite is true for the sounds of the 
phonemes p/b/m which have similar lip movements but different acoustic spectra. 
Some researchers [4-7] have developed speech recognition programs that incorporate 
computer lip-reading, demonstrating considerable improvement over systems employ- 
ing acoustic signals only. Systems for automatic computer lip-reading can be used to 
build aids for people with hearing difficulties and in educational packages for the 
teaching and training of hard of hearing children. Visual speech recognition can also 
be applied in video conferencing, by compressing data using the information of the 
speaker’s lip movements. Study of lip-reading is used in simulating speech and talking 
faces in computer graphics. In the context of adaptive retrieval, a speech reading sys- 
tem could be used for automatic annotation of video sequences. 

It should be emphasised that many muscles or organs used in speech (e.g. the vocal 
cords and velum) are inside the mouth and are not visible to the eye. Lip movements 
play a relatively minor part in the production of sound [8]. A speechreader, whether 
human or machine has only the lips, jaws and the occasionally visible tongue as guid- 
ance for inferring speech. 

Another difficulty encountered by human and machine lip-readers is that many 
sounds (e.g. t, d, n, l etc.) do not require a prominent movement of the lips or the 
jaws. It is estimated that under usual viewing conditions approximately 60 percent of 
the speech sounds are either obscure or invisible [8]. There is confusion over half of 
the vowels and diphthongs and three fifths of the consonants. Another important fac- 
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tor that renders speech recognition a very challenging task is that a particular lip 
movement is generally common to many phonemes, hence making them visually 
indistinguishable. An example of such phonemes is lp,b,ml in which lips come to- 
gether or Ifvl the lower lip to upper teeth movement etc. Such sounds are grouped 
together into visemes, which are the representative units of visual speech and are 
roughly equivalent to phonemes in acoustic signals of speech. Speech experts put all 
the consonants of English language in a group of 4 to 12 speechreading movements. 

In this paper, we explore the use of fuzzy set theory and learning techniques in vis- 
ual speech recognition. The recognition is based purely on visual information of the 
lip images of a speaker and no audio information has been taken into account. Classi- 
fication algorithms based on normalising time using the cross product space approach 
have been developed and segmentation algorithms using fuzzy rules have been formu- 
lated (although the segmentation work is not reported in detail here). Efficient storage 
of the vocabulary of words, incorporating context information within its structure has 
also been investigated. 



2 The Data 

The raw data used for automatic lip-reading was provided in the form of a video of 
the lip movement of a person uttering some elements of speech. The video was con- 
verted to a sequence of frames and features were automatically extracted from each 
frame of the video sequence. The technique employed for automatic facial feature 
extraction from images is described in [9]. Additional searching algorithms were 
added on top of these classification routines to find the corners of the lips and mouth 
and measurements were obtained for the following features (see Fig. 1): 

1. Width of lips 

2. Height of lips 

3. Width of mouth 

4. Height of mouth 

5. Height of upper lip 

6. Height of lower lip 

The following additional features, which are not always visible, were also taken into 
account : 

• Height of tongue below upper lip 

• Height of tongue above lower lip 

• Height of tongue between upper & lower teeth 

• Height of upper teeth 

• Height of lower teeth 

Since the objective of this paper is to illustrate the learning and speech recognition 
capabilities of the application, details of feature extraction are not presented here. For 
each sound or word a plot of features against time can be obtained. For example, plots 
of all the features for the words ‘feet’ and ‘peak’ are shown in Fig 2. 

After feature extraction the words in the vocabulary were segmented manually on 
phoneme boundaries and similar phonemes were grouped into viseme classes [10]. 
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2.1 Fuzzy Representation of Feature Evolution 

Speech data for a word or a phoneme is acquired as a plot of feature values against 
time. The learning task is to provide a phoneme classification from the feature / time 
graphs. This data is composed of sequences of different lengths. A phoneme compris- 
ing 15 frames in one word might be composed of 20 frames in another word. Also, 
the duration of the utterance of consonants is much longer than that of a vowel. To 
enable all sounds to be defined in a uniform way, the time intervals for phonemes are 
normalised to the same length. 





Our representation of speech data in time is based on the use of compound words 
in Cartesian space and extended Fril rules derived from them. A word or a linguistic 
term can be represented by a fuzzy set of points representing a clump of elements 
drawn together by similarity [11]. We define fuzzy partitions over the domains of 
feature values and time, as shown in Fig. 3. Any point on the axis has a nonzero 
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Fig 3. Fuzzy partitions over feature value and time 



membership in at the most two fuzzy sets. Each triangular fuzzy set is referred to as a 
word (granule or a label). For illustration purposes we choose three fuzzy sets for 
each feature here; in practice, a finer partition may be needed. 

We take the cross product of the fuzzy partitions, to define 2-dimensional fuzzy sets 
over the feature and time axes. These cross product fuzzy sets are Cartesian granules. 
In Fig 4(a) each cell models a compound word (Cartesian granule), so for example the 
shaded cell represents “feature value is medium and time is middle”. 




Fig. 4. (a) grid defined by cross product fuzzy sets (left) and (b) counting procedure for filling 
the grid 



A simple counting procedure based on the theory of mass assignments [12-14] has 
been adopted for constructing the grid from example points in the data set. The data 
point (x, t) shown in Fig 4(b) can be written in terms of the two-dimensional fuzzy 
sets (small, start), (small, middle ), (medium, start), (medium, middle), with a propor- 
tion of the point falling in each cell. Using mass assignment theory, the proportions in 
this case are 



M small start Q)’ Ismail (^Mmiddlr (0 ’ ^medium start () ) 1 f^medium(-^)f^middle( 0 






Learning Fuzzy Rules for Visual Speech Recognition 169 



respectively, where ju,(y) is the membership of the point y in fuzzy set i. For a given 
data tuple, the membership of the feature value and time in their corresponding fuzzy 
sets is determined and the relevant cell is incremented by the product of memberships 
in the two fuzzy sets. 

After considering each point in the feature/time graph, the count in the grid is di- 
vided by the total number of entries to give a feature probability distribution 9 ip over 
the cells i for a given class k. The conditional probability of a class k given cell i is 
given by 



Pr(class k \ ce// ( )= 



Pr(ce//;| class k j Pr (class k ) 
Pr (cellf) 



0 ik Pr (class k ) 




j 



0 lk Pr (class k ) 



Udij Pr (class 

j 

Assuming that all classes are equally likely the above form reduces to: 
6 - 

Pr (class | cell ,■ ) = v 

. i/ 



j 



When an example point from the test set is encountered, the probability of each 
cell ei is determined for the features in the test data point. The support, k for class . for a 
single grid which represents a single feature k is given by: 

support jk = X ej Pr (class j \ celli ) 
i 

The overall support for the example belonging to class i using m grids representing 
m different features is averaged as: 



m 

X support jj 

support ; = — 

J m 

The number of mutually exclusive fuzzy sets placed on the time axis remains the 
same for every sound or word and hence they are made broader or narrower depend- 
ing upon the number of sequences comprising the word. At each time step the mem- 
bership of a feature in its corresponding feature fuzzy sets and time fuzzy sets is de- 
termined and a grid on the cross product space of words representing features and 
time is obtained. Now this grid can either be converted to a fuzzy set using the theory 
of mass assignments or it can be normalised with respect to various classes of sounds 
in an appropriate manner to derive extended Fril rules using Bayesian theorem. 

The framework for modelling a single sequence of sounds can be used in the gen- 
eration of a knowledge base of rules. Each viseme or a whole word can represent a 
class. Each class of sound is therefore represented by a number of features. When a 
test sequence of speech sounds is encountered, its support for each class is determined 
from the rule base. The class having the highest support is the classified sound. 
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2.2 Initial Findings: Tulipsl Database 

To explore the performance of rules built on the cross product space of fuzzy gran- 
ules, an experiment on a small database (Tulipsl) was carried out. The Tulipsl data- 
base was compiled by Movellan (15]. It was formed from 12 speakers, 9 male and 3 
female, each saying the words ‘one’, ‘two’, ‘three’ and ‘four’ twice. There are 934 
gray-scale images of 100x75 pixels taken at 30 frames per second. The audio signals 
included in the database were not taken into account for this experiment. 

Because the automatic feature extraction software relied on colour to detect lips, 
the corners of the mouth and lips were marked by hand and the changes between each 
successive frame for six features (1-6 in Section 2) were extracted manually [16]. 
Training was performed by generating rules from 1 1 speakers and leaving one out for 
testing. The process was repeated by including each speaker in the test set once and 
the results were averaged over all speakers. Four extended Fril rules were generated 
for each word. Rules were generated on whole words, so no segmentation of the vis- 
ual speech data was required. The results obtained for different numbers of fuzzy sets 
are shown in Table 1. 



Table 1. Various results for TULIPS 1 database using extended Fril Rules 



Number of fuzzy sets on each feature 


7 


7 


6 


Number of fuzzy sets on time 


2 


3 


3 


Average prediction accuracy 


91.7% 


91.7% | 


92.7% 



Since this database was prepared from words spoken by several speakers, the re- 
sults point to the generalising capabilities of the Fril extended rules. 



Table 2. Comparison of different methods for Tulipsl Database 



Method 


Average Accuracy 


Extended Fril rules 


92.7% 


Diffusion Networks [17] 


91.7% 


Hidden Markov Models [18] 


90.6% 


Humans without lip-reading knowledge 


89.93% 


Humans with lip-reading knowledge 


95.49% 



The results obtained from extended Fril rules are comparable to and even slightly 
better than the results obtained by other methods applied to the Tulipsl database (see 
Table 2). Movellan and Mineiro [17] achieved an accuracy of 91.7% by training dif- 
fusion networks (a stochastic version of recurrent neural networks). Luettin and 
Thacker [ 1 8] attained an accuracy of 90.6% by training Hidden Markov Models on 
the 5 most discriminating features representing shape and intensity and their delta 
parameters. When presented with sequences from the database, humans with no lip- 
reading knowledge achieved an average of 89.93%, while hearing impaired people 
with knowledge of lip-reading obtained 95.5% accuracy [15]. Tables 1 and 2 clearly 
illustrate the effectiveness of the extended Fril rules. The rules are speaker independ- 
ent and hence robust enough to handle different speakers. 
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3 A Larger Database 



Encouraged by the results of the Tulipsl database, a visual speech database was de- 
veloped at the University of Bristol. The lip movement of a phoneme varies depend- 
ing upon the phonemes that follow or precede it. Therefore, when training the system 
on various sounds, it is important to provide it with a sample of such sounds that oc- 
cur in different contexts within a word. Keeping this point in view, a medium sized 
vocabulary of words was developed with a well balanced phonetic content [TO]. There 
were 302 distinct words in the database. 

To form the visual database, the words in the vocabulary were spoken by a female 
speaker without a strong accent in English. The video was taken in a well lit room. 
The camera was focused only on the mouth of the speaker and the video frame rate 
was 25 frames per second. The length of the video sequences for each word ranged 
from 11 to 37 frames. There were around 6000 coloured images, 250x160 pixels in 
size, occupying about 720M bytes of disk space. 

After the formation of consonants, vowels and diphthongs viseme groups, the clas- 
sification of phonemes into their corresponding viseme groups has to be explored. For 
this purpose the entire database of words was segmented manually on viseme bounda- 
ries. The dataset of 302 words was split differently into 4 datasets, each set having 
151 distinct words [10]. Training was performed on 151 words and 151 unseen words 
were used in the test set. The results obtained from these 4 sets for the test set are 
summarised in Table 3. 



Table 3. Average accuracy (percentage) of classifying visemes as phonemes for the test set 





Data Set 1 


Data Set 2 


Data Set 3 


Data Set 4 


Average 


Consonants 


61.82 


63.72 


63.28 


61.39 


62.54 


Vowels 


78.5 


77.72 


81.22 


71.96 


77.3 


Total 


67.46 


68.54 


69.35 


65.03 


67.58 



4 Word Recognition 

For word recognition, it is necessary to have an automatic scheme for determining the 
boundary of phonemes within a word. Work reported here has been based on manual 
segmentation. An investigation into automatic segmentation is reported in [10], where 
comparable results were obtained in the best case. 

Additionally, a scheme has to be devised for storing the vocabulary of words in a 
database. For this purpose all the words in the test vocabulary must be represented in 
an efficient way, enabling the system to perform a quick search for the most likely 
uttered word. One possibility is to take a test sequence and compare the probability of 
its occurrence with all the utterances in the test set (e.g. [5]). This might be a good 
scheme for small vocabularies but for larger test sets it would be slow and inefficient. 
An alternative is to represent the words as a tree-like structure, in which each branch 
represents a viseme group. The leaf nodes point to a possible set of words that are 
found by taking a particular path. Words that are common to one node have the same 
characteristic lip movement. 
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The tree structure enables a quick search for the possible likely word. When a test 
sequence is encountered, the system takes the first branch and finds the first likely 
boundary in the sequence for the relevant viseme class by the segmentation algo- 
rithms. Then using the corresponding boundary the system determines a support for 
that branch. The branches with low supports are abandoned and those with high sup- 
ports are followed until the terminal node is reached. At the terminal node, all the 
supports are averaged to find an overall support for the word. Since the phoneme at 
the beginning of a word has a more prominent lip movement than at the end. a 
weighted average of the supports at the various branches should be taken into ac- 
count. For this work, a heuristic measure was used to assign the weights to each 
branch. The weights were assigned as w.= 100 - 10/, where / represents the depth of a 
branch. Hence the overall support for a word or a set of words at a terminal node is: 

IwiSi 

s word ~ v 

L w i 

i 

The structure for the representation of the vocabulary of words not only enables a 
quick and efficient searching for a likely set of words but also has other advantages. It 
automatically embeds the context information of the occurrence of a viseme with 
respect to its neighbours in a word. For example the occurrence of a p/b/m viseme 
group is highly unlikely after an f/v group. Therefore when a p/b/m group is found the 
system only goes into the relevant branches which reduces the search space by a con- 
siderable amount and takes the context information into account. When selecting only 
one word out of a selection of words, the accuracy of the system was 21%, and when 
selecting a group of words, the accuracy of the system was 33%. 

The reader should be reminded again that a high accuracy in visual speech should 
not be expected because visual data carries only partial information regarding speech. 
Expert human lip-readers achieve an estimated accuracy of 30% on nonsense words 
[19] and if an average lip-reader is presented with a series of syllables with conso- 
nants followed by a vowel, the accuracy is about 25% [20]. 



5 Related Work 

Approaches to this problem in the past can be broadly classified as image-based or a 
model-based [18]. In a model-based approach explicit features of the mouth are ex- 
tracted from video images and used for recognition. Generally, exact measurements of 
the lip area are derived from the visual database. In an image based approach the 
entire raw or processed image is fed as input into a recogniser. Model based ap- 
proaches are resistant to rotation or scaling of images, and also the lighting conditions 
under which the video of a speaker has been taken. However, some information is lost 
in feature extraction and clearly performance is dependent on selection of the “cor- 
rect” features. It is also a difficult task to build up an accurate and robust system for 
accurately measuring features from a set of visual images. 

On the other hand, image based approaches have an advantage over model based 
approaches in that they do not discard any information. The capability of analysing 
the changes in displacement of the various facial cues such as skin and wrinkles can 
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be embedded in such systems. A drawback with this method is that it leads to a huge 
dimensionality of the feature vector which may have some redundant information. 

The first major work on automatic lip-reading was done by Petajan in 1984 [21]. 
He used an approach where contour information of the oral cavity from image se- 
quences was used to perform speech recognition, with linear time warping to perform 
template matching in this approach. Petajan’s system was further extended by Gold- 
schen [19] who developed an impressive visual-only continuous speech recognition 
system. His system achieved an accuracy of 25% on sentences. 

Finn and Montgomery [22] studied optically based speech recognition for conso- 
nants of the English language by gathering data by from a male speaker who had 12 
highly reflective dots placed on his face. Mase and Pentland [23] adopted optical flow 
techniques to lip-reading. Yuhas et al [24, 25] used an image based approach towards 
lip-reading and fed an entire image into a neural network for the recognition of speech 
sounds. Movellan [15] used a technique based on diffusion networks to perform vis- 
ual speech recognition. Silsbee and Bovik [6, 26] extracted features from visual 
speech signals by using vector quantisation. Different mouth configurations, defined 
by 17 code vectors, were selected by hand. The work done by Luettin and Thacker 
[18, 27] involved learning patterns of shape variability for tracking lips in gray scale 
video images. The extracted features were modelled by Gaussian distributions and 
their temporal dependencies by Hidden Markov Models. 

In the past, researchers have developed optical speech recognition systems. How- 
ever, these systems have only been tested by small vocabularies because of the limita- 
tions imposed by the bulk of video data. Systems developed in the past that make use 
of a big vocabulary system containing whole words are those developed by [19] and 
[6, 26]. In contrast to most previous work, the work presented in this paper was not 
only tested on smaller databases but also on a bigger vocabulary of 310 isolated 
words. 

6 Summary 

In this paper various methods involved in automatic recognition of isolated words in 
visual speech have been discussed. Novel methods for representing speech data and 
its classification have been illustrated. The results of this work compare well to pub- 
lished work with accuracy in the range 20-25% and shows the feasibility of the appli- 
cation of fuzzy set theory to visual speech recognition. The accuracy achieved by 
manual segmentation is higher than the accuracy attained through automatic segmen- 
tation. The results from automatic segmentation are poorer because of the difficulty in 
distinguishing between the boundaries present between a consonant and a vowel at 
the end of a word. Generally this transition involves a very subtle change in feature 
values which is hard to detect. There is a need for improvement of segmentation algo- 
rithms which can detect such changes. 
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Abstract. This paper presents results, at an early stage of research work, of the 
use of fuzzy decision trees in a multimedia framework. We present the discov- 
ery of rules in three different indexing scenarios. These rules represent knowl- 
edge that can be interpreted as guidelines for the development of better index- 
ing tools. We use a fuzzy decision tree algorithm to extract these rules (just) 
from color proportions of key-frames extracted from one video-news broadcast. 
Experimental results and comparisons with other data mining tools are pre- 
sented. 



1 Introduction 

On the one hand, the growth of video data has caused a need to analyze and exploit it. 
Hints of this increase are the availability of video news on the web or the appearance 
in the market of video recorders equipped with hard drives. Due to the overwhelming 
quantity, it appears that the users tend to interact e.g. in order to find what he wants. 
But in order to respond to the user requests indexing is needed. Unfortunately, today’s 
indexing is generally done manually. Added to this the growth of video data and the 
requirement of new applications for finer grain access, urges an automation of the 
indexing process. 

On the other hand, in the recent years, fuzzy data mining introduces new method- 
ologies to extract and discover fuzzy knowledge from either classical or fuzzy data 
repositories. It leads to the improvement of the knowledge of the domain from where 
the data is obtained. The advantage of using fuzzy algorithms is that it enables us not 
only to offer in a more comprehensive way the discovered knowledge, but also to be 
able to handle uncertain and/or fuzzy data (as well as traditional numerical or sym- 
bolic data) 1. 

Thus, it appears natural and promising to link fuzzy data mining with multimedia 
data to obtain robust fuzzy multimedia mining. For instance, in this paper the mined 
rules are knowledge that can be used to improve the indexing process (i.e. helping the 
development of better indexing tools). 

Besides the problem of the quantity, dealing with multi-media introduces a new 
difficulty related to the polymorphism of the data 2. In fact the information that can 
be extracted from, for instance, a video (or a website) are texts, images, sounds, tem- 
poral data, metadata, etc. A solution is to use flexible and automated data-mining 
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tool, which will induce knowledge from all kinds of data. A particular instance of 
such tools is the fuzzy decision tree based algorithm 1. 

In this paper we presents results, at an early stage of research work, of the use of 
fuzzy decision trees in a multimedia framework. Here we focus on the mining of 
color proportions of key-frames extracted from one single video-news. This simple 
approach allows us to clearly understand (interpret) the results and identify potential 
difficulties. However, the presented approach is designed in such a way that it can be 
directly applied on more complex data, as for instance cross-media indexes. 

In Section II, we shortly provide some references for our data-mining tool: the 
fuzzy decision trees algorithm and software. In Section III we illustrate the knowl- 
edge discovery process. We expound the discovery of rules for three different index- 
ing scenarios. These rules represent knowledge that can be interpreted as guidelines 
for the development of better indexing tools. Experimental results, comparisons with 
other data mining tools and limitation of the decision tree approach are also presented 
in this section. Finally we conclude with a short discussion about the obtained results. 



2 Fuzzy Decision Trees 

Knowledge Discovery from Data (KDD) was introduced at the beginning of the nine- 
ties 4. However, due to the complexity of multimedia data, Multimedia Data Mining 
(MDM) was recently proposed as a new topic of research 5. 

In fact, in a multimedia framework, versatile data-mining tools are necessary. One 
case of such tools is the fuzzy decision tree learning algorithm, which provides rules 
that summarize and explain the data. We use the Salammbo software, which is able to 
handle typical numerical input (non fuzzy), and it constructs without human interven- 
tion a fuzzy decision tree 6. 

Another advantage of using fuzzy decision trees resides in the fact that they repre- 
sent in a natural and understandable way the knowledge: a fuzzy decision tree is 
equivalent to a set of fuzzy "if.. .then" rules. 

These trees are built based on an entropy measure, which translates a certain de- 
gree of order. In other words we can automatically discover which features are the 
most important (discriminant) and what are the values to be consider for these fea- 
tures. 

In Figure 1 we can see an example of the output of the Salammbo software for 
mining based on colors for the detection of inlays (see section III.A). The rules are 
self-explanatory even though the decisions "less than" and "greater than" are fuzzy 
(see Figure 3). 



3 Discovering Indexing Rules 

In order to show the potential of using fuzzy trees (and more generally any data min- 
ing tool), we restrain our research to a well-known feature: the color. 
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PerceniOfWhiif less than 1058 

B-QSpercentOfSeagreen less than 0 08 
® to PercentOfWhite less than 2.75 
El Key-frame without mlays 
sto PercentOfWhite greater than 2 75 
B^PercentOfLightpink less than 0 40 
to PercentOfLightpink greater than 0.40 
El Key-frame with inlays 
0 to PercentOfSeagreen greater than 0.08 
^ PercentOfWhite less than 9.86 
F-to PercentOfWhite greater than 9 86 
El Key-frame with inlays 

Fig. 1. Example of rule extracted by Salammbo 



We start from a set of key-frames extracted (per shot) from a single news broad- 
cast (video) 7. In a second step, the set of colors of each key-frame is vectorized and 
"projected" to a given reference-palette (for instance a palette of 64 or 256 colors 
obtained by sampling equally the RGB space). We obtain like this a common basis to 
compare the key-frames. Then for each key-frame a histogram of frequencies (of the 
colors) is computed. This provides us with a vector in a reference space (defined by 
the colors of the reference palette) for each key-frame (see Figure 2). 

So, from each key-frame we obtain a vector, which is then considered as a training 
example. Thus, from a set of classified key-frames (examples), a training set can be 
composed. Finally we built the fuzzy decision tree using the Salammbo software. 

In this paper we consider three different mining problems. All related to the extrac- 
tion of knowledge associated to the extraction of the general structure of the video 
news (macro-segmentation). 



3.1 Discovering the Presence of Inlays 

Inlays that appear on the TV screen are very often hints for the structure of the video 
news. They also appear either when a new person is presented or when a report ends. 
They usually consist in a square or a rectangle that frames some text (e.g. name of the 
journalist, name of the place, etc). 

We have conducted several mining experiments in order to determine if colors are 
relevant (discriminant) for the detection of inlays. We composed a training set with 
176 vectorized key-frames of one single video-news, to each vector was assigned one 
class (type) of the key-frame: either with or without inlays. 

A first experiment was conducted with the whole training set and based on a refer- 
ence palette of 64 colors. The resulting fuzzy decision tree is not very deep and has a 
root node on which the presence of the white color is requested (white is the major 
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Fig. 2. Color histogram extraction, (not actual key-frames) 



background color of inlay key-frames in the training video-news broadcast). We 
notice that the accuracy, recall and precision of this "root-rule" are extremely high 
(for details refer to 3). This result points out that only a few numbers of colors is 
needed to discriminate the presence of inlay key-frames. And more generally, we 
confirm the empirical observation that the use of colors can be used for discriminat- 
ing the appearance of inlays in a single video. 

This rule confirms the intuition and seems to be trivial. But if we look closely at 
the fuzzy sets built by the system, we notice that the proportion of the main color of 
the inlay has to by inside a fuzzy range. In other words the system not only tell us 
what the rule is, but also that a certain range has to be respected. In fact, it was clear 
that the presence of a large proportion of one single color is a hint of an inlay, but too 
much of that single color means that it there is actually not an inlay. In our case the 
percentage of white had to be greater than 2.75% and less (fuzzy membership) 
12.03% (see Figure 3). 
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Fig. 3. Membership function describing low and large percentage of white color 

Another interesting observation is that other colors (other than white) are used to 
determine the presence of the inlays. By studying this carefully, we found out that 
this is due to a bad projection (or detection) of the colors to the reference palette. This 
kind of problem is common in any visual indexing system. Here, using the fuzzy 
decision tree, we not only discover potential problems, but also the system provides a 
solution: to use the "wrong projected" colors. 

Notice that the fuzziness we are dealing with in these experiments is only related 
to the proportions of colors. But it is clear that further research should focus on the 
uncertainty related to the projection of the colors. 

In order to compare the Salammbo algorithm with regard to other learning algo- 
rithms, we conducted a second experiment (see Table 1). For the other algorithms we 
used the free software Weka 8. We remark that here, the recall and precision values 
of the model are as important as the accuracy of the model: it is important to perfectly 
recognize at least one kind of key-frames (with or without inlays in this case). 

We can observe that our fuzzy decision tree method is not only among the meth- 
ods with the highest accuracy, but also, presents high recall and precision rates for the 
inlays recognition. Moreover, it appears that the construction of fuzzy decision trees 
by Salammbo is also among the lowest time consuming methods, an important prop- 
erty for multimedia applications. This, in addition to the understandability of the 
fuzzy decision tree model, Salammbo presents better accuracy and quickness ratio 
than any other of the tested methods. 



Table 1. Results for inlays recognition (64-colors palette) 



Algorithm 


Accur. 

(%) 


With inlays 


Without inlays 


Bid. 

Time (s) 




Recall 


Precision 


Recall 


Precision 




Salammbo (FDT) 


81.3 


0.88 


0.78 


0.75 


0.86 


1 


Naive Bayes 


54.6 


0.39 


0.57 


0.71 


0.53 


0.4 


Voted Perceptron 


71.6 


0.71 


0.72 


0.73 


0.71 


0.6 


Weka J48 (C4.5) 


78.4 


0.75 


0.81 


0.82 


0.77 


1 


Decision table 


82.9 


0.87 


0.81 


0.8 


0.85 


3.3 


Neural Network 


80.1 


0.84 


0.78 


0.76 


0.83 


322 
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3.2 Discovering Errors in the Shot Detection 

Fundamental information for structuring a video is the shot detection. A shot is a 
sequence filmed by one camera without any cuts and can be considered as the basis 
for a macro-segmentation. Even though it is claimed that this information is, at to- 
day’s state of the art, easy to extract, we observed that the shot detection tools pro- 
duce in general a lot of false positives. 

Thus, we have conducted two experiments to see if with a very naive approach, we 
are able to discover when two successive key-frames are part of the same shot. And 
this only based on colors proportions. In reality, our mining problem is a bit more 
complex than just a shot detection, since we used as training base and test base, only 
key-frames from errors of a shot detection tool. We expect like this, not only to test 
the fuzzy decision trees, but also to mine knowledge, useful for the improvement of 
the shot detection tools. 

In order to test the trees, in the first experiment, we did not use any a priori knowl- 
edge, while in the second we injected some knowledge: the correspondence between 
colors. 

The first training set was made out 92 learning examples separated in two groups 
(46 for each class). The first group of training examples was composed by two suc- 
cessive key-frames from two successive shots (class "different shot"). Then each 
key-frame was vectorized in a 64 colors palette. Finally, the training vector was built 
by merging the two vectors. We obtained a single training example with 128 features. 
Notice that there is no information about the relationship between colors (for instance 
color ; and color t+64). The second group (class "the same shots") was also composed 
by two successive key-frames, but this time they belonged to the same shots, even if 
the shot detection tool considered (wrongly) as from different ones. 




Fig. 4. Construction of training vectors for the shot error detection experiments 
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By observing the great resemblance of the key-frames for each error, we expected 
to find rules translating the similarity. But, the built fuzzy decision tree highlights that 
the most discriminant approach is to detect the increase of the proportion of a particu- 
lar color: white (here the dominant color of inlays). For this experiment, the mean 
accuracy with cross-validation of the fuzzy decision trees was 78.26%. 

This rule suggests that the most common shot detection error is the appearance of 
inlays. We also learned that after a classical shot detection (usually based on similari- 
ties), the accuracy of the shot detection can be enhanced using the color dissimilarity 
between successive key-frames. 

Based on these results, we set up a second experiment, where we forced in the idea 
of looking at the differences. This time the training vectors were built in a different 
way: Instead of merging the two 64 features into a 128 color vector, we computed the 
difference color by color (for instance color i minus color ;+64) to obtain a 64 color 
difference vector. This time with the exact same data set, we obtained as mean accu- 
racy with cross-validation 86.9% (an improvement of 1 1 % with respect to the first 
experiment). The rules points out again that the reason of the errors is the appearance 
of the inlays, but this time the model is more accurate. This reveals one limitation of 
the decision trees, the incapacity to combine different attributes. 



3.3 Discovering Host, Diagram, Correspondent 

Another crucial hint about the structure of the news is to detect the appearance of the 
host. Thus an experiment was conducted in order to recognize the presence or non- 
presence of the host (anchor) in a key-frame. 

As in previous experiments, a training set of 50 learning examples was con- 
structed. A first group of 25 training examples was composed by key-frames, where 
the host appears. A second group was composed by key-frames (randomly chosen) 
without host. Each key-frame was vectorized in a 256 colors palette, which was the 
best size for this experiment. 

Based on this example dataset and on the size of the face of the host in the positive 
examples, we expected to obtain a rule related to the skin color (sort of simple face 
recognition system). But the most effective fuzzy rule to recognize if we are in pres- 
ence of the host (or not) is based on a color of the host’s background (presently, one 
specific blue color). This rule points out that the best way to know if we have a host 
is to look if the scene takes place in the studio. In fact, with this rule we can differen- 
tiate the host from a journalist or a guest not in the studio. In Figure 5 we see the 
region where the color is present (dashed area). 

The accuracy of the fuzzy decision tree here is 88%. 



3.4 Discussion about the Extracted Knowledge 

For these three different problems, the extracted knowledge is in the form of three 
unexpected seminal rules. Note that we did not introduce any a priori knowledge. 
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Fig. 5. Localization (dashed area) of discriminating color in host presence detection 

In order to recognize the presence of inlays, the system suggests detecting a large 
proportion of a single color, putting forward that all inlays are large forms of similar 
colors (inside one video). 

In order to detect the presence of a host, we should focus on the background, 
which corresponds to determine if the scene was taken in the studios. Even if it seem 
a paradox not to look at the host to detect his presence, this rules is more effective 
than looking at the characteristics of the host himself. It is clear that this rule cannot 
be directly used for another news broadcast with a different studio environment. But 
what is important here is that now we know (knowledge) that inside one single 
broadcast in order to effectively detect the anchor we should focus on the studio envi- 
ronment (background). 

In order to ameliorate the shot detection (which is naturally based on similarity) it 
is recommended to look at the differences between key-frames. More precisely, the 
rule suggests that an inlay has been detected, implying that this is actually the cause 
of the errors. 

We would like to remind here that this rules are intended as guidelines for the de- 
velopment of better indexing tools (i.e. useful knowledge), and not directly applicable 
for indexing. The extracted knowledge depends always on the used dataset. Here, we 
worked on one single video file and therefore the rules could be only used as index- 
ing rules for the same type or similar video news program. But if we wanted to use 
the extracted rules directly on any video news, then we should train our fuzzy deci- 
sion trees on a representative dataset containing all type of video news. The risk of 
such approach is that the variety of formats will melt the interesting knowledge and 
we will not able to find any interesting rule. Our results tell us what are the discrimi- 
nating features once we are working on one single type of video journal and not how 
to know everything in any case. In order to get more general, what should be done 
instead is to check if the same type of rules is also discriminating for other types of 
format of video journals. We are currently working on this problem. 
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4 Conclusion 

In this paper, we presented an example of multimedia knowledge discovery applied 
to the structuring of news in a video format. We used fuzzy decision trees, because of 
their simplicity and because of the understandability of the extracted rules. We fo- 
cused on the mining of the visual aspects and in particular the color feature of the 
key-frames. The extracted rules provided hints for better indexing in a structuring 
perspective. In fact, these rules deal with appearance of important information on the 
screen (inlays), the presence of the host, or correcting possible mistakes in the shot 
detection. 

This is a first step in the direction of a complete multimedia mining of a video (i.e. 
cross-media mining). But still in this only one media approach shows it potentiality. 
The next step is to continue the mining of visual contents as for instance the texture 
and also the mining of structural content (length of a shot, type of transition). Future 
work will consider the other medias (sound, text, etc.) in order to exploit the interac- 
tion. We will in particular compare a per-media mining and then fusion with an alto- 
gether mining. 
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Abstract. If large amounts of images are available, as it is the case in 
image archives, image retrieval technologies are necessary to help the 
user to find needed information. In order to make such queries possible, 
images have to be enriched with content-based annotations. As manual 
annotation is very costly, system support is desired. This paper intro- 
duces an approach to learning image region classifiers from extracted 
color, texture, shape, and position features. In different experiments, 
three machine learning algorithms were applied for the classification of 
character shapes and regions in landscape images. 



1 Introduction 



Advances in information technologies during the last decades have made the 
storage of large amounts of data possible. Besides text this also includes images 
and videos. If many images have to be handled, as it is the case in videos, image 
archives or satellite images, image retrieval technologies have to be set up to help 
the user to find relevant information. In order to make such queries possible, im- 
ages have to be enriched with content-based annotations. As manual annotation 
is very costly, system support is desired. In order to automate the annotation 
process, it is necessary to identify objects in images, to extract features, and to 
create classifiers, which later can be used for automated classification of objects. 

In this work, an approach to applying machine learning (ML) technologies for 
the automated creation of classification rules for image regions is introduced. The 
next section gives an overview of which features were used as input for the ML 
algorithms and how they can be determined by image processing techniques. The 
following section shows our approach to learning classifiers and to applying them 
to unknown image regions. This section also presents an approach to enabling 
adaptivity by establishing multiple classification schemes. Section 4 presents 
experiments on two different test sets: character and landscape images. Following 
the section about related work, this paper ends with a conclusion. 
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Table 1. Color, texture, position, and shape features 



Color Features 


Texture Features 


Shape Features 


Mean of hue / lightness / saturation, stan- 
dard deviation of hue / lightness / satu- 
ration, length of hue / lightness / satura- 
tion interval, minimal lightness / satura- 
tion value, maximal lightness / saturation 
value, color identifier 


Shape of primi- 
tives, linelikeness, 
coarsness, regular- 
ity, directionality, 
contrast, softness 


Number of lakes / vertices / edges / shifts in direc- 
tion / bays / main lines, circularity, convexity, lake 
factor, circularity of convex hull, direction, eccen- 
tricity, horizontal / vertical position, relative size, 
horizontal / vertical balance, cog shift, main bay di- 
rection 



2 Color, Texture, Position, and Shape Features 

This section presents the image processing technologies which have been applied 
for extracting features from images. Some algorithms for image segmentation 
and feature extraction have been taken from the PictureFinder ( PF ) system [10]. 
Additionally, feature extraction algorithms for shape features and for statistical 
color features have been developed. An overview of the extracted features is 
shown in Table 1. 



2.1 Image Analysis with the PictureFinder System 

PictureFinder is an image retrieval system for analyzing and querying images 
with certain properties in image archives. During the analysis phase, an auto- 
mated segmentation and annotation of color and texture features is performed. 
In order to find images with certain properties in the image archive, object 
recognition rules can be defined manually. The system also allows for annotat- 
ing keywords to images. It is possible to search for images that contain objects 
with some color or texture features at different areas of the image. An overview 
of the PF system can be found in [10]. 

The color analysis of the PF system performs a color segmentation of the 
image. The segmentation algorithm is applied to an image representation in the 
HLS space (hue, lightness, saturation) . For that reason, a transformation of RGB 
images has to be carried out. 

The color segmentation groups pixels of similar colors together to regions. 
The assignment is determined by comparing the difference of hue, lightness, 
and saturation of two neighboring pixels. If the thresholds are exceeded, a new 
region is created. The color segmentation is performed by an extended blob- 
coloring algorithm [1] for color images. The result of the color analysis are the 
different color regions, represented by their bounding box, their centers of gravity, 
and their colors (’’‘blue’”, ’’‘purpleblue’”, ” ‘purple’”, ”‘redpurple”\ ”‘red”\ ”‘or- 
angered”’, ’’‘orange’”, ”‘yelloworange”’, ’’‘yellow’”, ”‘greenyellow”’, ” ‘green’”, 
” ‘bluegreen” ’, ’’‘black’”, ”‘darkgrey’”, ”‘grey”\ Mightgrey’”, ’’‘white’”). This 
information about regions might be sufficient for image retrieval tasks in image 
databases. For the creation of classification rules, in this work additional color 
statistic features are provided: 



— hue mean , lightness me an, saturation mean - Mean of the hue / lightness / sat- 
uration value of all pixels of this region. 

— hue st ddev , lightness s tddev, saturation stddev'- Standard deviation of the hue / 
lightness / saturation value of all pixels of this region. 
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— hue ra nge, lightness ra nge, saturation range'- Length of the interval where all 
hue / lightness / saturation values of this region are. 

— light.nessmin, saturation m i n : Minimal lightness / saturation value of this 
region. 

— lightness m ax , saturation ma x- Maximal lightness / saturation value of this 
region. 

The hue values are represented by a cyclic value set (angle between 0° and 
360°) where no minimum or maximum exists. In order to determine the mean of 
the hue values at first the smallest interval with all hue values of the region in 
the color circle is chosen. This interval is then used for the computation of the 
mean and interval length [13]. The different lightness and saturation values are 
represented by integers between 0 and 255. 

Besides the color analysis, the PF system provides a texture analysis compo- 
nent. For the texture analysis, a segmentation into texture regions is performed 
by applying region-based and edge-based methods [8] . From the different regions 
samples are taken and used for the extraction of texture features [9]. The tex- 
ture features are: shape of primitives (possible values: lromogene, multi-areas, 
blob-like), linelikeness, coarseness, regularity, directionality, contrast, softness 
(possible values are {very low, low, medium, high, very high} represented by 
{0.0,0.25,0.5,0.75,1.0}). 



2.2 Shape Analysis 

This subsection presents a component for the extraction of shape features. The 
shape analysis assumes that an image has already been divided into regions 
and takes object images as input. An object image is a black and white image 
where all pixels of the object are identified by the value one, and all other 
pixels have the value zero. This region image is transformed in further steps into 
other representations: object contour, polyline, and convex hull. In this section 
we describe which features are extracted during which steps. A more detailed 
description of the shape analysis can be found in [13]. 

The object image can be used to compute the position of the region, its 
relative size, and the number of lakes (holes) in the object. The position of the 
object is computed by the scaled moments with order 0 (m_s(0,0)) and order 
1 (m_s(l,0) and m_s(0,l), Eq. 1, cf. [16]). The relative size is the ratio of the 
region size to the image size (Eq. 2). 



F horPos = and F,„ 



FrelSize — 



m_s(0, 0) 

F n - 



rPos 



imageSize 



m_s( 0, 1) 
m_s(0, 0) 



(1) 

(2) 



For the computation of the number of lakes a region growing algorithm like 
the blob-coloring approach is applied [1] . All regions are counted which are not 
part of the object, i.e., where the value is zero, and which are not connected to 
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Fig. 1. Edges, vertices, shifts in di- 
rection, bays and lakes of an object 




Fig. 2. Bounding box and center 
lines of an object for the balance 
computations 



the border of the image. By this approach only regions are counted which are 
inside the object contour but not part of the object itself. This feature is called 
Flakes with Fi a kes €= 0 . 

The object image is taken as input for the creation of the object contour. 
It is processed line by line until the first pixel is found. From this starting point, 
the object contour is tracked. The chain code of the contour could easily be 
created and used for the computation of some features (e.g., perimeter) at this 
time. We just use the object contour pixels as input for the polyline computation 
in order to apply algorithms with better performance on that representation. 

The list of contour pixels with their x and y positions is used to create a 
polyline with the ’’’Iterative End-Points Fit”’ algorithm [6]. The polyline rep- 
resentation is used to extract many features. Vertices can be found between two 
neighboring edges. We define the occurrence of a vertex if the angle between two 
edges is less then a threshold threshold corner (e.g., 135°). For the computation 
of the angle, Pythagoras theorem and trigonometric functions can be used (cf. 
[13]). The number of vertices is called F vert i ces . The number of edges F ec [ ges is 
equivalent to the number of lines in the polyline representation. The number of 
main lines counts the number of lines which form the main part of the contour, 
F ma in Lines • First of all the mean length of all lines is computed and multiplied by 
an adjustable factor mainLineFactor (e.g., 0.5). All lines exceeding this value 
are main lines. Neighboring lines are merged to one line if their angle is less than 
threshold mainLineAngleThreshold (e.g., 170°). A shift in direction appears if 
there is a shift from a right into a left curve (or the other way round). The 
directions of curves can be identified by computing the signed distance of the 
following vertex to the current line segment [13]. An example for edges, vertices, 
and shifts in direction is shown in Fig. 1. 

The computation of perimeter F pol y gon p erime ter, area F po iy gonArea , and cir- 
cularity F circu i arity (to be calculated using the two prior attributes) can be easily 
done from the polyline representation [1,16]. 

The three moments of second order can be used to calculate the parameters of 
an ellipse with equivalent moments as the objects moments [7]. This information 
can be used to compute the lengths and directions of the principal axis and 
the minor axis of this ellipse. Steger presents an approach to compute arbitrary 
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Fig. 3. Shift of the center of grav- 
ity for a ” ’T” ’-shape 



Fig. 4. Main bay direction of the 
letter ”’C”’ 



moments from the polyline representation [18]. indirection stands for the direction 1 
of the principal axis. Having the lengths of the axes (A m and X n ), the eccentricity 
is computed by Eq. 3. 



eccenticity 



= 1 -F, 



ellipseRatio 



= l - 



r convexity Factor — 
FlakeF actor — 



Fpolygon Area 
FareaC onvex Hull 
FlakeArea 



{Fa- 



FlakeArea ) 



( 3 ) 

( 4 ) 

( 5 ) 



The vertical and horizontal balance describe the balance of the object areas 
of the left to the right side from the object center, and for the upper to the lower 
side, respectively. For the computation, the smallest bounding box parallel to 
the object direction is taken. Two lines g\ and g 2 through the center are created 
where g\ is parallel to the object direction and r /2 is orthogonal to g\ . Counting 
the pixels to the left and right of g\ and above and below 32 , and correlating these 
values, the balances F ver pj a i anC c and F^orBaiance can be calculated. F ver Baiance 
and FhorBaiance are in the range [0 1], where the value 0.5 stands for complete 
balance. Fig. 2 illustrates the bounding box and g 1 and 32 for a distorted letter 
”’h”’. The center of gravity shift- can be calculated by the features given so far. 
It is the direction of the vector from the bounding box center to the center of 
gravity (see Fig. 3). 

The Quickhull algorithm is used for the convex hull computation from the 
polyline [2]. An object’s convex hull, lakes, and bays are shown in Fig. 1. The 
convexity factor correlates the areas of the object to the area of the convex hull. 
It describes how convex an object is (Eq. 4). The circularity of the convex hull 
FchCircuiarity is computed like the regular circularity of the object by using the 
area and perimeter of the convex hull representation. 

The lake factor describes how holey an object is by calculating the ratio of 
the area of lakes to the area of the object if it had no lakes (Eq. 5). The number 
of bays Ff )ays is determined by comparing the object’s polyline to the polyline of 
the convex hull [13]. The main bay direction is the angle between the principal 

1 For our character recognition experiments the axes have been switched, if the prin- 
cipal axis was nearby 0° to compute the upstanding angle of a character. 
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Fig. 5. Training and test phases of the region classification system 



axis and the orthogonal line to the bay entrance. The direction of the main bay 
is discretized into eight regions as stated in Fig. 4. Instead of taking the direction 
of the bay with the biggest size, the sizes of the different discrete directions are 
accumulated. The direction with the biggest sum of bay sizes is the main bay 
direction FraainBay Direction- 



3 Machine Learning for Region Classification in Images 

In the last section, a set of color, texture, position, and shape features was 
introduced. This section describes the approach to the automated creation of 
object classifiers. The following section shows evaluation results for three ML 
algorithms on two different image sets. 

Fig. 5 shows the two main phases of the system. In the training phase, clas- 
sifiers for different classes are created from example regions by ML algorithms. 
The user can select different regions of the training images as samples and assign 
classes to them. During the testing phase, selected regions are tested with the 
different classifiers. The class with the highest confidence value will be assigned 
automatically. Now the different steps in both phases are described briefly. The 
steps up to the value mapping in Fig. 5 are identical for the two phases. 



3.1 Segmentation and Feature Extraction 

For the segmentation of input images, the existing contour, color or texture 
segmentation algorithms of the PF system can be used. It is possible to apply 
just one algorithm or a combination of different algorithms, as demonstrated 
in Fig. 6. In some domains it might not be reasonable (or even affect results 
adversely) to use more than one segmentation method. 





A Combination of Machine Learning and Image Processing Technologies 



191 




Fig. 6. Segmentation of the input image 

The features which can be used as input for the ML programs were introduced 
in section 2. It does not make sense to use all features in all domains, e.g., the 
color of a text is usually not important for character recognition. It is possible 
to deactivate features to avoid slowing down or irritating the ML algorithms by 
irrelevant features. 

3.2 Value Mapping 

The features span a n-dimensional space where each example can be seen as one 
point in this space. The size of the space is dependent on the number of attributes 
and the number of values of the attributes. The number of possible combinations 
is nr=i v i n features T\, F 2 , . . . , F n with Vi possible values for Fj exist. If ten 
attributes with five possible values are used, the number of possible combinations 
is already 9765625. Value mappings can be applied to reduce the space for the 
ML algorithms. In our work we used three different kinds of mappings: 

— Discretization. Mapping of values into a set of discrete values, e.g., map- 
ping [0 1] to {very low, low, medium, high, very high}. 

— Mapping. Mapping of continuous values to symbolic values, e.g., mapping 
{0.0, 0.25, 0.5,0.75, 1.0} to {very low, low, medium, high, very high}. This is 
necessary if a ML algorithm can just handle symbolic values. 

— Pruning. This mapping type prunes an ordered value set after a defined 
value, e.g., IN 0 to {0,1, 2, 3, 4}. 

3.3 Learning and Classification 

The learning input for the ML programs are the manually classified training 
samples. For each training sample, all active features are extracted and prepared 
for the ML programs. After the input is set up, the ML programs are applied. 
Depending on the ML approaches, different classifiers are learned, e.g., concept 
descriptions, a decision tree or a neural net (NN). After classifiers have been 
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learned, they can be applied to unknown regions. In the classification phase 
classes are assigned automatically to these regions. 

Three different ML algorithms have been integrated for the creation of clas- 
sifiers: C4.5 [17], AQ18 [12] and the backpropagation algorithm for training a 
neural net [19] using the the Stuttgarter Neuronale Netze Simulator (SNNS). 
Due to lack of space, these algorithms cannot be described in detail. A descrip- 
tion of these algorithms and their parameters can be found in [12,17,19]. The 
next paragraphs just introduce the rough idea of these algorithms. 

C4.5 creates decision trees from examples. The root node consists of all ex- 
amples and at each node the examples are divided depending on their values for 
a certain attribute. The selection of the attribute is motivated by information 
theory: at each step the attribute which leads to the highest information gain 
will be selected until all nodes contain only samples of one class (or until another 
stop criterion has been satisfied). 

AQ18 uses the AQ algorithm to create concept descriptions from samples. 
The algorithm covers the training data with rules until all examples are covered 
by at least one rule. In each step one example is taken from the set of (remaining) 
positive examples and generalized rules are created which do not cover any neg- 
ative examples. Applying a criterion function, the most preferable rule is taken 
and all covered examples are removed from the positive example set. 

The backpropagation algorithm is a standard learning algorithm for feed- 
forward networks. The initial weights are chosen randomly and are adjusted by 
the backpropagation algorithm. In each iteration an example is selected and the 
difference of the actual output to the real output is calculated. The weights are 
adjusted to minimize the error until a threshold has been reached. In our case 
the neural network has the features as input layer, one hidden layer with ten 
neurons and for each class an output neuron. The weights of the net are randomly 
initialized between [0 1] and each experiment is repeated three times. All feature 
values are mapped to the range [0 1] . The learn factor and the number of learn 
cycles vary at the different experiments. 

3.4 Multiple Classification Schemes 

Different users often do not have the same idea of how to structure information. 
Depending on the background and the current task, the user has different needs 
while searching information in a multimedia retrieval system. Thus, it is desire- 
able to allow the users to definine their individual classification schemes. In some 
cases it might even be useful to define different views on the data for one person 
or a group of persons in order to support the users performing different tasks. 

Diverse classification schemes can address different image regions of disjoint 
domains or look at the same information from varying perspectives. In the latter 
case it can be distinguished if different classification schemes just represent other 
levels of granularity or structure objects in a completely different way. 

In order to satisfy these requirements, the system must be able to deal with 
different classification schemes. In combination with user profiles and possi- 
bly some basic organization management functionality like the management of 
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groups or roles, the different classification schemes can be assigned to differ- 
ent users and groups. While working with the system, the user can select one 
of his available views on the information sources. By selecting a certain view, 
the available concepts for retrieval tasks and training procedures are adapted to 
the current view. Depending on the domain, a selection of attributes to use for 
learning and testing can be performed. 

4 Experiments 

In two experiments we investigated the quality of ML technologies for the au- 
tomated classification of image regions. In the first experiment, classifiers are 
learned for character recognition from shape features. In the second experiment, 
color, texture, and position features are used as input for learning. In all exper- 
iments the sets of training and testing samples have been disjoint. 

Different settings for the value set mappings and configuration of the ML 
algorithms have been tested. The used configurations for AQ18 [12] are: 

Config. 1 Intersections between rules are allowed (mode=”’ic”’), trim to 
short rules (trim=”’mini”’), no noise. 

Config. 2 Intersections between rules are not allowed (mode=”’dc”’), trim 
to short rules (trim=”’mini”’), no noise. 

Config. 3 Intersections between rules are allowed (mode=”’ic”’), specialized 
rules (trim=”’spec”’), no noise. 

Config. 4 Intersections between rules are allowed (mode=”’ic”’), trim to 
short rules (trim=”’mini”’), handle noise (q_weight=”’0.25”’, rule_probe= 
”’2”’, rtol=’”20”’). 

In all four configurations these AQ parameters were the same: maximum size of 
the star (max_star=’”10”’), handling of ambiguous examples (ambig=”’max”’), 
preference criterion (default: LEF=” inaxnew, minsel’”), and testing method 
(default: INLEN mode). 

The configurations for C4.5 [17] are: 

Config. 1 No pruning, at least 2 branches with 2 samples (m=”’2”’). 

— Config. 2 Pruning (cf=”’25%”’), at least 2 branches with 2 samples 
(m=’”2”’). 

— Config. 3 Pruning (cf=”’10%”’), at least 2 branches with 2 samples 
(m=’”2”’). 

— Config. 4 Pruning (cf=”’25%”’), at least 2 branches with 4 samples 
(m=” ’4”’). 

4.1 Character Recognition 

Our character recognition experiments do not try to compete with existing OCR 
algorithms on printed texts. They are rather suitable for displayed texts in videos 
or of signs in photographs, where object borders are blurred. The character recog- 
nition experiments have two aims: studying the different learning algorithms and 
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Fig. 7. Sample image for character 
recognition 



Fig. 8. Sample image for land- 
scape region classification 



evaluating the shape analysis component. It has to be shown if the chosen fea- 
tures are sufficient to distinguish different classes of shapes. The used features 
in these experiments are: circularity, eccentricity, convexity, number of vertices, 
edges, shifts in direction, lakes, bays, main lines, circularity of the convex hull, 
horizontal and vertical balance, lake factor, cog shift, and main bay direction. 
For an experiment with separate classes of capital and lower case letters, an ad- 
ditional feature turned out to be helpful to overcome similarities between many 
letters which almost look the same in capital or lower case form (e.g., ”’0”’). 
It is the relative size of a letter in relation to the biggest letter in the current 
image, where F max Are a is the area of the biggest letter: 



In these experiments three value mappings have been used: 

— Value mapping 1. Continuous value sets are mapped to five intervals. The 
cog shift values are mapped to eight intervals. The number of vertices, edges, 
main lines are pruned after the 30th value, the number of shifts in direction 
and bays after the 20th, and the number of lakes after the 10th. The main 
bay direction stays the same in all experiments (eight values) . 

— Value mapping 2. Continuous value sets are mapped to 20 intervals. The 
cog shift values are mapped to 16 intervals. The other mappings are like 
value mapping 1. 

— Value mapping 3. The continuous value sets in this setting stay untouched. 
The pruned values sets are mapped as in the first two settings. 

In the first part of the experiment it is tested whether the shape features are 
sufficient to distinguish 26 (or 52) different letter shapes. Eight different images 
with 52 letters of the Arial font (normal, distorted to left or right, horizontally 
and vertically stretched, rotated to left or right, scaled down) were used as 
training and testing input. In the different runs with varied configuration seven 
images were used for training and one for testing (i.e. , 12.5% of the samples). In 
the first run, capital and lower case letters were classified, in the second run only 
capital letters. Only the second value mapping was used here. C4.5 and AQ18 
were run with the different settings in all experiments; the NN was trained over 
500 learning cycles with learn factor 0.8. Table 2 summarizes the best average 
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Table 2. Best accuracies of the ML algo- 
rithms in the character recognition domain 



Run 


AQ18 


C4.5 


NN 


Single font 








Capital and lower case let. 


79.6% 


83.9% 


84.1% 


Capital let. 


88.5% 


94.3% 


87.5% 


Different fonts 








52 classes, mapping 1 


53.5% 


61.1% 


56.2% 


52 classes, mapping 2 


62.5% 


64.1% 


66.0% 


52 classes, mapping 3 


- 


67.1% 


67.0% 


26 classes, mapping 2 


61.1% 


69.8% 


68.6% 


Capital let., mapping 2 


70.2% 


75.2% 


80.3% 


Capital let., mapping 3 


- 


77.7% 


76.7% 


Disjoint fonts 








52 classes, mapping 1 


47.5% 


53.0% 


51.3% 


52 classes, mapping 2 


51.6% 


51.2% 


58.0% 


52 classes, mapping 3 


- 


55.4% 


58.0% 


26 classes, mapping 2 


52.1% 


60.7% 


62.5% 


Capital let., mapping 2 


60.8% 


65.4% 


71.7% 


Capital let., mapping 3 


- 


66.6% 


69.5% 



Table 3. Best accuracies of the ML 
algorithms in the landscape image 
domain 



Run 


aqi8 


C4.5 


NN 


Unknown images 








Mapping 1 


61.1% 


62.1% 


61.9% 


Mapping 2 


53.4% 


61.8% 


64.3% 


Mapping 3 


- 


63.4% 


63.0% 


Mapping 3, less attr. 


- 


65.2% 


63.6% 


Unknown regions 








Mapping 2 


68.7% 


71.2% 


67.9% 


Mapping 3 


- 


75.1% 


77.7% 



accuracies of all character recognition experiments. At the runs with all letters, 
AQ18 has the best average accuracy at the setting with most specialized rules 
and C4.5 with pruning (config. 2). C4.5 with 83.9% outperforms AQ18 (79.6%). 
The NN has the best result of 84.1% accuracy. If only capital letters are used, the 
accuracies are higher. C4.5 has the best result with the unpruned decision tree 
(94.3%), followed by AQ18 (config. 3 as above) with 88.5%. The NN classified 
87.5% of the test samples correctly. 

In the second part of this experiment, ten different fonts were used as training 
and testing (10%, four different test sets) samples. Five of of them are with serifs, 
the other five without. Besides the settings of 52 classes (separated classes for 
capital and lower case letters) and 26 classes (only capital letters), there was 
another setting with 26 classes, where capital and lower case letters form one class 
(e.g., ”’W”’ and ”’w” : are in the same class). In these experiments the NN was 
trained for 1000 cycles. The learn factor was initially set to 1.0 and reduced by 
0.1 every 100 learn cycles. In the runs with 52 classes, the best average result for 
AQ18 is 62.5% (short rules and mapping 2). C4.5 performed best with continuous 
values and pruning (config. 2) with the accuracy of 67.1%. The NN almost has the 
same accuracy here (67.0%). If the classes are put together into 26 classes, C4.5 
has an accuracy of 69.8% with the same settings. The NN also performs better 
with 68.6%. AQ18 classifies 61.1% correctly with the configuration that handles 
noise. If just capital letters are regarded, all algorithms have higher accuracies: 
80.3% with the NN (mapping 2), 77.7% with C4.5 (mapping 3, config. 3) and 
70.2% with AQ18 (mapping 2, short rules). 

In the third part of the character recognition experiment, the fonts of training 
and testing were completely disjoint, i.e., if some letters of a font are used as 
training, none of the remaining letters of this font is allowed to be in the test 
set (10% of samples, four different test sets). If 52 classes are distinguished, the 
best average result of the NN is with mapping 3 (58.0%). The best accuracy 
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B-outhypo 

# rule 

1 [holes=2] 

U-outhypo 

# rule 

1 [bays=l] [relSize2biggest=12. . 13] 
[cogShift=4. . 10] 
[mainBayDirection=2. .3] 

Y-outhypo 

# rule 

1 [edges=8 . . 9] [bays=3] 



Sky-outhypo 
# rule 

1 [uniformity=homogene] 

[color=blue , green , bluegreen , grey , light grey] 
[hueMean=ll . . 15] [lightnessStd=0] 
[saturationMean=l . .8] 

2 (...) 



holeFactor <= 0 : 

I curves <= 2 : 

I I relSize2biggest > 9 : 

I I I bays <= 1 : 

I I I I curves <= 1 : 

I I I I I relSize2biggest <= 13 : 

I I I I I I mainBayDirection > 1 : 

I I I I I I I relSize2biggest > 11 : U 

I curves > 2 : 

I | relSize2biggest <= 10 : 

I I I edges <= 9 : Y 
holeFactor > 0 : 

I bays > 0 : 

I | convexity > 16 : 

I I I holes > 1 : B 



Snowl ce-outhypo 
# rule 

1 [color=bluegreen , grey , light grey] 
[hueMean=9 . . 12] [hueStd=0 . . 1] 
[lightnessMean=4 . . 10] [lightnessStd=l . . 2] 
[verPosition=2. .7] 

2 (...) 



Fig. 9. AQ 

rules for letters 



Fig. 10. C4.5 
tree for letters 



Fig. 11. AQ rules for land- 
scape region classes 



of C4.5 is 55.4% with mapping 3 and pruning (config. 3). AQ18 performs best 
with short rules and mapping 2 (51.6%). If capital and lower case letters are put 
together, higher accuracies can be achieved: 52.1% with AQ18 (handling noise), 
60.7% with C4.5 (no pruning), and 62.5% with the NN. If just capital letters are 
used, the accuracies are even higher: 71.7% for the NN, 66.6% for C4.5 (mapping 
3, no pruning), and 60.8% for AQ18 (mapping 2, handling noise). Fig. 7, 9, and 
10 show created rules, a created decision tree, and one of the input images for 
the character recognition experiments. 

4.2 Region Classification in Landscape Images 

The second experiment analyzes how the ML methods perform using the color, 
texture, and position features for the classification of regions in landscape images. 
The different classes are: water (128 samples), sky (122), cloud (135), forest (133) 
and snow (125, including ice). In some cases bushes and green spaces have also 
been assigend to the class forest in order to get more sample regions. 643 regions 
from 41 images [5] were used altogether for training and testing (ca. 10% in all 
experiments). All texture features, the color identifier, vertical position, and the 
statistical color features 2 were used in these experiments. The NN was trained 
for 500 learning cycles with an initial learn factor of 1.0 (reduced every 100 cycles 
by 0.2). Three different value mappings were used: 

— Value mapping 1. The statistical values were discretized into eight and 
the vertical position into five intervals. The values for the texture features 
and the color identifier were not changed in all three mappings. 

— Value mapping 2. In the second mapping, the statistical values are divided 
into 16, the vertical position into ten intervals. 

— Value mapping 3. In this setting the continuous values are not changed. 

In the first run, completely unseen landscape images were used for testing. 
AQ18 has an accuracy of 61.1% with mapping 1 and handling noise. C4.5 per- 
forms a little better with 65.2% using mapping 3, no pruning and the reduced 
feature set. The NN classified 64.3% correctly with mapping 2. 

2 In some experiments the ranges, minimal and maximal values were omitted experi- 
mentally. 
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In the second run, the regions of all images were divided randomly into 
training and testing regions, i.e. , training and testing regions are allowed to 
be from the same image, but the training and testing sets are still disjoint. 
As the used landscape images were quite diverse, this experiment shows how 
advantageous it is if regions from the same image are in the training events. If 
the accuracies are much better in these experiments, the classifiers before did 
not learn general descriptions for the different concepts, but are specialized in 
the events from the training images. Indeed, here the results are better: the NN 
has the best accuracy (77.7% with mapping 3), followed by C4.5 with 75.1% 
(mapping 3, config. 3). AQ18 (handling noise) classified 68.7% correctly. Fig. 8 
and 11 show created rules and an input image for the landscape experiments. 



5 Related Work 

An older version of AQ was applied to recognize textures [14] . The ” ’Multi-Level 
Logical Template”’ (MLT) method by Michalski extracts local features by a slid- 
ing operator window, which are used to learn class descriptions for textures [15]. 
The MLT method has been enhanced in different ongoing works, e.g., by Chan- 
nic by applying filters and incremental learning on ultrasound images [4]. The 
MLT has been extended to the ” ’Multi-Level Image Sampling and Transforma- 
tion”’ (MIST) method and was used for the segmentation of landscape images 
and identification of blasting caps in x-ray images [15]. 

C4.5 was also used by Zrimec and Wyatt for object recognition under different 
lighting conditions [20] . The task here was to classify different regions in soccer 
scenes of the Sony legged robot league like goals, walls, and ground. The regions 
were characterized by different color (e.g., average red / green / blue values, 
hue and saturation), position, and shape (area, perimeter, “wiggliness” , scaled 
moments of inertia) features. By applying C4.5 more than 98% of the 631 training 
regions were correctly classified by the learned decision tree. The paper does not 
show evaluation results for regions which have not been used for training. 

Campbell et al. address the automatic interpretation of outdoor scenes [3]. 
In their work they use texture (eight isotropic Gabor filters), color, contextual 
(surrounding areas), and shape (principal modes of variation of approximating 
polygons, size, position) features to train a neural network classifier. They report 
an accuracy of 82.9% at region classification on over 3000 unseen test regions. 
The correctly classified regions cover 91.1% of all image areas. 

An empirical comparison between decision trees and NNs for the detection 
of defects in austenitic welds was performed by Jacobsen et al. [11]. In this work 
x-ray images are decomposed into regions of interests, where 36 features are 
extracted by different image processing procedures. In the experiments the NNs 
perform slightly better and use less attributes than the decision trees. 

In our approach, different image segmentation algorithms can be applied and 
combined to create image regions. Out of these regions, different feature extrac- 
tion methods are applied to extract texture, color, position, and shape features. 
The shape analysis combines well-known existing and some new extraction meth- 
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ods to create mainly easy to understand features. The system allows a flexible 
handling of the different feature extraction methods by enabling the user to 
select the attributes and their value mappings for training and classification. 

6 Conclusion and Future Research 

The experiments with the variations of the Arial font pointed out that the set 
of shape features is sufficient to distinguish at least 26 shape classes (94.3% in 
the best case). Many of the created rules of the symbolic learning approaches 
are comprehensible. In the more complex experiments the accuracies were worse. 
This can be explained by the problems with fonts with serifs and high similarity 
between some letters (e.g., ”’l”’, ”’i”’ and ”T”). The experiments on the land- 
scape images with different regions are still satisfying (up to 77.7% in the best 
cases). 

The NN performed best altogether, followed by C4.5 and AQ18. Regarding 
the readability it is the other way round. The rules created by AQ18 are quite 
compact and understandable. Decision trees can be hard to read if they have 
many nodes. The learned weights of a neural net give almost no information to 
the user. The choice of the ML algorithm depends on the needs of the user. If the 
rules have to be understandable, one of the symbolic learning approclres would 
be preferable. If highest accuracy is more important, the NN should be used. 

Support Vector Machines (SVM) have recently become popular for classifica- 
tion tasks and outperform many other classifiers. Thus, it would be interesting 
to take SVM into account in future comparisons. As in this paper the shape ana- 
lysis and classification was applied to a set of artificially created (and modified) 
characters, a real world application on video text captions should be performed. 
Future experiments can also investigate how context information (like neighbor- 
hood of a character) and a dictionary or statistical information about characters 
(e.g., in form of n-grams) could improve the classification results. Problems with 
fonts with serifs could be solved by adapting fonts, e.g., by removing serifs. 
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Abstract. The motivation for this work is to develop an image retrieval sys- 
tem that can discriminate between images and that can learn user's preference 
with feedback to make more intelligent. This paper proposes a neural network 
to extend prototype refinement which retains information fed by users. The 
proposed three-layered neural network indexes an image database and makes 
clusters by an unsupervised approach at a hidden layer. Given a query, the neu- 
ral system retrieves similar images by computing similarities with images in the 
near clusters by a supervised approach at an output layer. To provide prefer- 
ence, users can select some images as relevant ones or irrelevant ones. With this 
feedback, the proposed refinement method estimates global approximations of 
radial-basis functions centered, and simultaneously adjusts corresponding 
prototypes. The system demonstrated the effectiveness of prototype refinement 
generated by the proposed neural network. 



1 Introduction 

Image retrieval based on contents is an interesting and challenging problem. The 
recent emergence of multimedia database and digital libraries makes this problem all 
the more important. While manual image annotations can be used to a certain extent 
to help image search, such an approach caused the vast amount of labor required in 
manual annotation and the subjectivity of human perception. Content-based image 
retrieval was recently proposed to overcome these difficulties. 

As recently databases expend, more effective retrieval techniques are needed. In 
the information retrieval literature it has been well established that retrieval perform- 
ance can be significantly improved by incorporating a user’s feedback as part of the 
retrieval loop. Much has been written about relevance feedback in content-based 
image retrieval [1,7, 8, 9]. By using the relevance feedback, the user can adjust an 
existing query and change the rank ordering of returned images. However, most exist- 
ing feedback methods only take into account one query step and the knowledge ob- 
tained from older query steps of the same session or of other query sessions is forgot- 
ten. If a similar query is presented a second time, the system should retain previously- 
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learned user preferences, thereby improving the efficiency and quality of the search. 
Further, when different users submit similar queries but the level and depth of knowl- 
edge they have and what they expect differ, the system should return user-preferred 
results to each of them. If the retrieval system would adapt to individual users’ needs, 
then relevance feedback techniques would certainly be considered to support person- 
alization. 

To overcome these, the prototype refinement has been proposed [5]. For indexing, 
the system clusters images and each cluster is represented by prototypes. Users 
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Fig. 1 . Relevance feedback gives more control on the search criteria. In the figures, q means a 
query, p f , ju 2 , and p 4 , the centers of clusters (dotted lines), R a relevant result and lr an 

irrelevant result, (a) Query and 4 clusters, (b) Query refinement by modifying q to q\ (c) Proto- 
type refinement with only relevant results, and (d) Prototype refinement with relevant and 
irrelevant results 
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specify their particular preferences by selecting relevant images. Then relevance of 
these relevant images to the query is added into the corresponding prototypes. It is 
likely that different instances of feedback from a particular user will be related, so 
that over time, frequent users of the intelligent retrieval system are able to tailor the 
system to their needs. Fig. 1 (a) illustrates a case where there are a query ( q ) and four 
prototypes as centers of clusters in a two-dimensional feature space. Many existing 
systems use user feedback to adjust the desired query ( q ’). This process of query 
refinement is illustrated in Fig. 1 (b). For query q, query refinement modifies the 
query utilizing user feedback. Without prototype refinement, when the user enters the 
same query again, the system would need to again perform query refinement. Instead, 
Fig. 1 (c) shows how the prototypes are adapted to q and thus the system can start to 
search with those adapted prototypes. In this paper, we extend this prototype refine- 
ment by incorporating irrelevant images. As relevant results update prototypes by 
moving toward the optimal target, irrelevant results may update by moving away the 
optimal target (see Fig. 1 (d) comparing to Fig. 1 (c)). It can be achieved by nega- 
tively adding irrelevant results into clusters [13]. It is also possible to use a weighted 
metric to compute the distance between features, by putting less weight on the irrele- 
vant results [3], However, studies have shown that each feature of images has differ- 
ent importance in ranking images and is not applied linearly. 

The target of this paper is an intelligent image retrieval system using neural net- 
works. Section 2 overviews prototype refinement. In Section 3, a neural network is 
proposed for incorporating irrelevant results. Section 4 describes the image retrieval 
system using the proposed neural network. The system consists of three functions: 
indexing, retrieval, and refinement. Experimental results and conclusions are in Sec- 
tion 5 and 6, respectively. 



2 Overview of Prototype Refinement 

A given set of images can be partitioned into C clusters. Each of these clusters is 
represented by a prototype (// c ) tc_1 '" c| . Given the set of N c images in the cluster c, 
the prototype ju c can be computed by averaging the image features that belong to the 
cluster. To measure the spread of a set of data around the center of the data in the 
cluster, the standard deviation (a) is used. After clustering, images may be retrieved 
by computing two distances between a query q and prototypes ff and between q and 

images v in u‘ and then by sorting based on these distances. Since the scale of a 

feature can change the overall contribution of that feature in the distance computa- 
tion, each feature should be rescaled so that all features contribute equally to the dis- 
tance computation. The standard deviation (a) is used to capture such scale informa- 
tion as weights. That is, a feature's importance inversely proportional to the relative 
distance between the query and the prototype. It is also important to normalize each 
feature in the images to the same range to ensure that each individual feature receives 
equal weight in determining the similarity between two images [13]. 
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Then, prototype refinement specifies over time a set of relevant results using in- 
formation fed back by the user about the relevance of previously retrieved examples. 
The system begins by initializing the current database (m = 0). Given a query q, se- 
lection of images sharing similar characterizing elements starts by comparing q to the 
set of prototypes. If some prototypes are selected based on a distance measurement as 
being similar to q, only the images matching those prototypes are compared to the 
query q. At each retrieval step, the user labels the retrieved images as relevant q. The 
system then has to estimate user relevance feedback, i.e, for each retrieved image, and 
has to decide its level of relevance to the query q. The relevant image is clustered to 
the prototype ft to make it satisfy the user’s need better than the original query. 

Since the prototype ( ju Cm+l ) is refined by the user’s need, we refer to the process 
as prototype refinement (see Algorithm I). 



(1) Initialization: 

m = 0, N c ° = N c , p c " = p c , and o c ° = O c , 



where m = 0 means there is no relevance feedback to q. 
(2) Retrieval 

a. Calculate the weight of the cluster 



(0 Cm 

b. 

c. 



(3) 

(4) 



= 1 - exp 



M 



o c - *0 



Search similar prototypes using d cu (fj,jU Cm ), where d m (») means the weighted dis- 
tance between two vectors inside. 

Search relevant images in similar prototypes p c * using d (q,v)- 



User relevance feedback: Select relevant images from retrieved images. 
Prototype refinement: Modify each new prototype and standard deviation: 



AT- 1 = N c - + N c R m , 

K m 

N Cm p- m + 2^ R"° 

u Cm *' = a=l , and cr 1 

' N Cm 



n r , 



N c ‘ 



where N c " is the number of images in the cluster c, Nf the number of relevant 
images corresponding to the cluster c, R"‘ l the i lh feature of the d h relevant image, 

and sf = (erf” jf + (u- — p- m J • after the m h feedback, respectively. 

(5) m = m + 1 and go to step (2). 



Algorithm I. Prototype refinement with relevant examples 
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The next time the system is used, the user starts a search using fl Cm +1 , not u‘ . 

Through relevance feedback, each prototype can include information on what the user 
preferred previously. Whenever the system performs retrieval, the system recalculates 
d{q,/U Cm +1 ) using the refined prototypes and thus reduces the search step. 



3 Modified Adaptive Resource- Allocating Network 

Neural networks have been used in many areas such as pattern recognition, computer 
vision and control systems. For information retireval task, neural network should 
consider a dynamic classification problem which is varying relevance or irrelevance 
based on a given query. One simple way is to input two images (a query q and an 
image v) for all the features and to output one value which means the probability that 
v is relevant to the query q. In a supervised approach, we want that the output of the 
neural network to be close to 1 if all the features of two input images q and v are 
similar. If they are not similar, the output should be close to 0. In contrast an 
unsupervised approach, the two images are mapped to the same cluster or at least to 
clusters that are close together, if they are close to each other [10]. In this paper, we 
propose a neural network which combines a supervised and an unsupervised 
approach to incorporate prototype refinement. 



3.1 Radial-Basis Function Network 



An RBF network is a traditional hybrid supervised-unsupervised topology [2], The 
three-layer RBF network has a feedfoward architecture with an input layer x, a hidden 
layer y, and an output layer z. The layers contain /, C, and K nodes, respectively. Each 
hidden node represents a single RBF and computes a kernel function of x which is 
usually the following Gaussian function: 



^)=4-4=e X p 



f ii c 2 \ 

x — JU 






c= 1,2, ... , C, 



where /j, c and o' are the center and the width of cth hidden node, respectively, and 



is a distance measure that is generally taken to be the Euclidean norm as 

i= 1 , 2 , ... ,/, 



-M‘ = > 



where x t and /x‘: are the zth features of x and jU c , respectively. Each output layer 

node is given by the sigmoid function of the weighted sums of the outputs from the 
hidden nodes. 

=J j ojJ) < (x), 



where 0) ic is the weight between the cth hidden node and the kth output node. 
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3.2 Adaptive Resource-Allocating Network 

In 1991, Platt proposed a sequential learning technique for RBF networks [12], The 
resulting architecture was called the resource allocating network (RAN) and found to 
be suitable for on-line modeling of non-stationary process. Initially, the network con- 
tains no hidden nodes. On incoming training images, based on two criteria, RAN is 
either grown or the existing network parameters (the centers and weights) are ad- 
justed using least mean square gradient descent. The first criterion is based on 
prediction error 7^ —t k \, where t k is desired output, which states that the error in 

predicting the output by the existing network architecture should be significant. The 
other criterion is the novelty criterion which states that the distance between the 
observation and the winning center should be greater than threshold. If both the 
criteria are satisfied, then the data is memorized and a new hidden node is added to 
the network. 

An adaptive version of RAN, specially designed for image analysis tasks, was 
proposed for on-line modeling with incremental growth [4], Using centers and widths 
of the hidden nodes as templates for detection and segmentation provides a guide for 
an intelligent search through the space of possible nuclei. This approach improves 
object detection, by gathering a collection of desired objects specific to the 
application; it improves segmentation by generating higher quality initial outlines 
with different classes of objects. The centers and widths are updated proportionally 
using soft competitive learning, creating a set of tight, well-separated clusters. 



3.3 Proposed Neural Network 

For image retrieval and prototype refinement, an adaptive RAN is modified (see Fig. 
2). The modified network has two input images ( q , v) and one output node z. Since 
we assume no knowledge of the images, the network starts with no hidden nodes 
(clusters), as in the original RAN. The network grows by allocation a hidden node 
during training. The hidden layer of the network first groups images based on features 
and then assigns a node to each cluster. The network finds near clusters of two input 
images and computes a similarity of both. 

3.3.1 Allocating a New Hidden Node 

Training data are supplied to the neural network in the form of pairs (( q , v), t) of input 
and target. If a new input q is not similar to some images v in the near clusters and the 
prediction error is significantly large, a new cluster is created with q by allocating a 
new hidden node and the number of clusters is increased. The width of a new hidden 
is set proportionally to the distance from the existing nearest hidden node to the new 
input. A new hidden node is set by Platt’s method: architecture should be significant. 
The other criterion is a novelty criterion which states that the distance between the 
observation and the closest center should be greater than a threshold. A newly- 
allocated hidden node is set by Platt’s method: 
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N c+l = 1 , Ll c+] = q , o- c+1 = 4 ? - U" earen J, C=C+ 1 , 
where K is a constant overlap factor and is the existing nearest hidden node 

from q. This makes new clusters more likely to match the newly-created hidden node. 
Thus, after training on a collection of images, each hidden node represents a cluster 
of images that are nearby one another in the image space. Note q makes at most one 
new node, while q can appear several times as the part of input. 



3.3.2 Updating the Network 

If the training image q is most similar to the cluster u' or the prediction error is 
small, the winning cluster fi c and o c are updated by adding q. When learning 
online, the immediate environment of a single weight within a multi-layer network is 
highly non-stationary, due to the simultaneous adaptation of other weights, if not due 
to the learning task itself. Stepwise gradient descent is probably the simplest approach 
to updating: 



N c ’ 

A co c 

a//; 



= AT +1, 

= 2a c rj(t){q)\t - z|z|l - z \ , 

= 2a c #( V> . ) <p c ( q)(o c \t - z\ 



= 2a ( n</>( 



<<?, -tf) 5 



)</> c (q)a>ct-z 



where a c is the parameter indicating the similarity between Ll' and q and is set 1 if 
//' is the winning cluster. 

Due to the smaller steps that are taken for the adaptation process, soft competitive 
learning (winner- take-most) will lead quickly to a nearby optimum, while hard com- 
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petitive learning (winner-take-all) will have more possibilities to get stuck into well- 
separated local optima. So in addition to the winner, the proposed network also up- 
dates some other clusters depending on their similarity with q: 
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where LI f anhest is the farthest hidden node from q. 



4 Application to Image Retrieval 

4.1 Image Indexing and Retrieval 

During training of the network, image databases are clustered and indexed. Features 
to represent an image are extracted and input to the network. With a training image, 
the network finds the nearest and farthest clusters based on input features, respec- 
tively. Images in the nearest cluster are probably similar and those in the farthest 
cluster dissimilar to the training image. Thus, the training image and images in the 
nearest cluster are used as positive images and the training image and images in the 
furthest cluster used as negative images. 

The hidden nodes of the proposed network in Section 3 are trained by an unsuper- 
vised approach and by on-line learning. Whenever a new image is assigned to an 
existing cluster, the prototype /J, the center of the cluster, is updated by averaging the 
image that belongs to the cluster. If the new image is not similar to existing proto- 
types, a new prototype is created with the new image. All this processing is done in 
the background, creating an index of images that is used for retrieval. Note that each 
cluster may contain disconnected regions. 

Using the trained neural network and the indexed (clustered) image database, we 
can find the similarity between a query image q and each image v in the near clusters. 
In each step, the similarity measurements of features of both q and v are fed into the 
neural network. The output value of the network will be the similarity between the 
two images. We can sort and rank the images in the near clusters based on the output 
of neural network. 



4.2 Refinement with Retrieved Examples 

After retrieving results on the given query q, the refinement step starts to initialize m 
and parameters as the step (1) in Algorithm I. The network has the query q and rele- 
vant results R, as input, and 1, as output. If q is most similar to the cluster including R 
or the prediction error is small, the winning cluster /if and <j c R m are updated by 
stepwise gradient descent in Section 3.2: 
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If the result is irrelevant IR, the output is 0 and the same test is performed. Since the 
irrelevant result needs to move away from the corresponding prototype, the prototype 
is updated as: 

Mir ~ Mir ~^Mir > 

= & C m and 
CO m - CO lR — A CQ m . 

This process is repeated as long as the user provides feedback. 

5 Experimental Results 

5.1 Experimental Setup 

In the experiments, we use the color image database which contains 9908 images 
used in WBIIS [6,14], The system uses three query methods for color: a pixel-by- 
pixel comparison (Pixel), a color histogram (CQ) and a color coherence vector (CCV) 
[11]. For the color image database, the system first quantizes the RGB color space 
into 4096 colors and then clusters 4096 colors to 64 representative colors. After resiz- 
ing all images in the database to the size of the given query image, pixel-by-pixel 
comparison sums the pixel differences in such a clustered color space. Color histo- 
gram and CCV are calculated in such 64 representative colors. 

The proposed neural network used 128 (or 256) input (/) for color histogram (or 
CCV), no initial prototypes (C), and one output (K). The parameter values of the 
network used for this experiment were: 77 = 1.1 and K= 0.5. The network was trained 
with 9888 color images randomly selected. The remaining 20 color images are used 
to test the neural network for retrieval and refinement As an input pair, each training 
image was submitted to the neural network with up to 5 positive images selected by 
an expert in the near clusters and with 5 negative images selected randomly in the far 
clusters, respectively. 

The performance in the experiments is evaluated using two measures: precision 
and the number of retrieved results. Precision is the ratio of relevant images retrieved 
to the total number of images retrieved; it measures the ability of a system to present 
only relevant images. The number of retrieved results is the total number of images 
retrieved in the database. 

5.2 Experimental Results 

Fig. 3 shows the average results of 20 queries on the WBIIS database. Adding more 
prototypes continuously improves the precision and thus the system works more 
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Fig. 3. Results on WBIIS database show prototype refinement generated by the proposed neural 
network increases precision: (a) results of color histogram, and (b) results of CCV 
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accurately as it learns user’s preference. On the WBIIS database, after the first re- 
finement, the color histogram query gets comparable or better results compared to the 
pixel-by-pixel queries and no refinement (Fig. 3(a)). The CCV query with refine- 
ments also gets better results than no refinement (Fig. 3(b)). 

Fig. 4 shows a sample result of the color histogram query (the upper image). The 
user can provide her feedback by clicking one box of relevance or irrelevance. After 
adding her feedback, the system estimates her preference and changes the order of 
images. Fig. 5 and Fig. 6 show that prototype refinement improves the retrieval 
result of the query. An important aspect of this paper is that the proposed system 
returns Fig. 6 when the user submits the same query after adding feedback and updat- 
ing prototypes. 



6 Conclusions 

This paper proposed a neural network to extend prototype refinement by updating 
prototypes (hidden nodes) with both relevant and irrelevant examples. Using global 
approximations in the hidden nodes, the proposed neural network updates prototypes 
by moving toward the optimal target (in a relevant case) or by moving away the op- 
timal target (in an irrelevant case). We are currently extending our work in several 
directions. For instance, we are exploring ways to incorporate heterogeneous features 
into prototype refinement. We are also studying to find and model personal character- 
istics on search through user feedback. 
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Fig. 4. Initial results of a red rose 
using color histogram and user feed- 
back: 12 retrieved images are ranked 
from left to right, top to bottom. Each 
image has two boxes on relevance R 
and irrelevance IR and a user can 
provide by clicking one of both 
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Fig. 5. Results after adding 1“ user 
feedback (Fig. 4): the user can pro- 
vide more feedback 
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Fig. 6. Results after adding 2 nd user 
feedback (Fig. 5) 
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Abstract. This paper describes a music information retrieval system which uses 
humming as the key for retrieval. Humming is an easy way for the user to input 
a melody. However, there are several problems with humming that degrade the 
retrieval of information. One problem is a human factor. Sometimes people do 
not sing accurately, especially if they are inexperienced or unaccompanied. An- 
other problem arises from signal processing. Therefore, a music information re- 
trieval method should be sufficiently robust to surmount various humming er- 
rors and signal processing problems. A retrieval system has to extract pitch 
from the user’s humming. However, pitch extraction is not perfect. It often cap- 
tures half or double pitches, even if the extraction algorithms take the continuity 
of pitch into account. Considering these problems, we propose a system that 
takes multiple pitch candidates into account. In addition to the frequencies of 
the pitch candidates, the confidence measures obtained from their powers are 
taken into consideration as well. We also propose the use of a query engine 
with three dimensions that is an extension of the conventional DP algorithm, so 
that multiple pitch candidates can be treated. Moreover, in the proposed algo- 
rithm, DP paths are changed dynamically to take relative spans and pitches of 
input and reference notes into account in order to treat split or union of notes. In 
an evaluation experiment, in which the performance of a conventional system 
was compared with that of the proposed system, better retrieval results were ob- 
tained for the latter. Finally, we implemented a GUI based music information 
retrieval system. 



1 Introduction 

In conventional MIR (Music Information Retrieval) systems, retrieval keys mainly 
consist of text information, such as a singer’s name, a composer, the title of a piece of 

A. Nlimberger and M. Detyniecki (Eds.): AMR 2003, LNCS 3094, pp. 212^227, 2004. 
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music, or the lyrics of a song[5], [9]. Several recent systems utilize music information 
extracted from a humming as a key for the information retrieval. The full query-by- 
humming system was first proposed by Ghias et al. [2], and several such systems 
have been developed, including MELDEX [14], Themefinder [15], TuneServer [16], 
MiDiLiB [13], Super MBox [17], SoundCompass [18], etc. These systems use vari- 
ous melody representations and matching methods. In several previous research stud- 
ies, durations (sound lengths) and pitches obtained from a user’s humming were used 
as music information [1], [2], [3], [5]. 

Various MIR systems have been reported in the literature. For example, by Ghias 
et al. proposed an approach for modeling the content of music objects. In their sys- 
tem, called Query by Humming (QBH), a music object is transformed into a string 
which consists of three kinds of symbols (‘U’, ‘D’ and ‘S’, which signify that a note 
is higher than, lower than or the same as the previous note, respectively). The prob- 
lem of music data retrieval is then transformed into approximate string matching. 
Therefore, two or more pieces of music may be retrieved for a certain humming. 

The work developed in the MIT Media Lab. [7] employed HMM to model and 
classify the melodies of folk music. In this work, they compared four kind of 
representation: absolute pitch representation, absolute pitch with duration 

representation, interval (relative pitch) representation and contour (relative pitch 
classified into 5 categories) representation. MELDEX allowed a variety of different 
contour representations, such as exact interval or more than 3-level contour 
information. It also attempted to incorporate rhythmic information mainly represented 
as note durations, but the corresponding matching method was not effective. For 
example, absolute note durations were used for melody matching. In practice, almost 
no users can hum in such a precise way, and this system attempted to identify only 
the beginnings of melodies. TuneServer project used only 3-level contour information 
to represent melodies. Although no new methods were incorporated in this system, its 
melody database was quite large. MiDiLiB project allowed a variety of different 
contour representations, such as exact interval or more than 3-level contour 
information. Similarly, this system did not propose any new methods. 

Actually there are several important issues in building the MIR system when re- 
trieval is done from the database and the hummed tunes. One is the influence of the 
user’s individual characteristics, such as differences of tonality and tempo. Further- 
more, singing errors such as split or union of notes are contained in a user’s humming 
[7], [8]. Another problem is that, even if the hummed queries are perfect, it is still 
difficult to implement a 100% accurate system for transcribing the hummed signals 
into musical symbols. To deal with these problems, we need an effective representa- 
tion of the hummed melody and a musically reasonable matching method. Therefore, 
we consider the following problems: the event detection problem, the feature extrac- 
tion problem, the melody representation problem and the melody matching problem. 
We believe the above problems are the key points for realization of a robust and effi- 
cient MIR system. 



214 Sung-Phil Heo et al. 



2 Overview of MIR System 

An overview of our music retrieval system is shown in Fig. 1 . There are several main 
components in our system: an event detection module, a feature extraction module, a 
melody representation module and a similarity measurement module. First, a user 
hums a melody to input the music information. The system detects notes from the 
humming and then extracts the features, such as the spans and multiple pitches, along 
with confidence measures. The similarity measurement engine works using the ex- 
tracted features between the humming and the database. The query engine carries out 
dynamic programming (DP) based on matching of the humming to the database, and 
some of the nearest matching melodies found in the database become the results. A 
ranked list of matching melodies is then displayed on the screen. The retrieval result 
contains a list of ranked melodies, with a song name and the time position where 
there is a match with the target tune. 




3 Event Detection 

The purpose of event detection is to identify each note’s onset and offset boundaries 
within the acoustic signal. It is possible to detect one interval (a note) using suitable 
threshold value processing from amplitude-based segmentation [20]. However, by 
applying simple threshold-based segmentation, two or three notes may be then united 
into one note, or one note may be split into two or three notes. Consequently, event 
detection errors may produce a critical problem of retrieval accuracy. 

The note length can be acquired from the event detection process. The note length 
can be defined by either duration or span, as shown in Fig. 2. 

In this paper, duration is defined as the time from the beginning to the end of a 
note. Span is defined as the time from the beginning of one note to the beginning of 
the next note. In the experiment, span is used as the feature of note length since span 
shows better retrieval results than duration a user who sings staccato. The last note 
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Duration 



Span 



Fig. 2. Example of duration and span 



was neglected since it has no span, and according to the preliminary experiment, 
better retrieval results were obtained when the last note was not used. 

For precise extraction of spans and pitches, it is necessary to correctly perform 
segmentation of one note. In this paper, the singing method is restricted to /ta/, /ka/, 
/pa/ or /cha/, to enable highly precise event detection by observing the power differ- 
ence between analysis frames [3], [8], [20] . Because of the limited singing method, it 
is easy to detect the boundary between the explosive sound / 1/ and voiced sound /a/. 

For exact event detection, we employ a preprocessing called the combined filter 
processing (CFP). The CFP is the combination of a band pass filter and a differential 
filter. First, a band pass filter (600- 1 ,500Hz) is applied to the humming to avoid de- 
tecting a respiratory sound. Then the powers of the filtered input are calculated 
frame-by-frame. Finally, a differential filter is applied to the sequence of powers. The 
beginning of a note in the humming is detected when the CFP output exceeds the pre- 
defined threshold. Fig. 3 shows a flowchart of an event detection method by CFP. The 
pitch frequency of the note is extracted from the center frame of the note. 




Band Pass Filter 



Differential Filter — Kjnterval Detection^ 



Fig. 3. The event detection method by CFP (Combined Filter Processing) 



Fig. 4 shows comparison of the event detection between the amplitude-based seg- 
mentation method and the proposed CFP method. The respiratory sound in A section 
is removed in the C section by using a band pass filter. Two or three notes which 
merged into each note in B section are segmented in D section when the differential 
filter is used. Therefore, when we used the proposed CFP method, we got highly 
precise event detection. 



4 Feature Extraction 

Pitch is a perception that is defined as the characteristic of a sound that gives the 
sensation of being high or low. A pitch extraction algorithm is applied to the hum- 
ming input, but it often captures an incorrect frequency. As such there exists no 
unique definition of pitch and so numerous researchers have modified the definition 
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Fig. 4. An example of the event detection (Left: conventional amplitude-based segmentation 
method, Right: proposed CFP method) 



as per their requirements. Extraction of pitch is difficult and requires different strate- 
gies under different conditions - that is why many pitch detection algorithms exist and 
why research continues. In the context of music, pitch has been defined as that char- 
acteristic of a sound which makes it sound high or low, i.e., the factor which deter- 
mines its position on a scale. For a pure tone, the pitch is determined mainly by the 
fundamental frequency. In fact, fundamental frequency is a physical attribute of any 
periodic or quasi-periodic signal, whereas pitch is a perceptual attribute evoked in the 
auditory system. 

To estimate the scale of the hummed note, the pitch extraction is carried out upon 
the user’s humming. The accuracy of pitch extraction greatly affects the system’s 
performance [4], [6], [20]. Therefore, we consider multiple pitch candidates to en- 
hance the performance of retrieval. In this section, a method is described that calcu- 
lates the multiple pitch candidates as well as their confidence measures. 

4.1 Extraction of Multiple Pitch Candidates 

Pitch extraction is based on cepstral analysis [4]. Fig. 5 shows the basic flowchart of 
pitch extraction. First, the power spectrum is obtained from the input signal using 
FFT. Next, the logarithm and IFFT are applied to obtain the cepstrum. Then cepstral 
peaks in the fundamental frequency’s range of existence are chosen as pitch candi- 
dates. Finally, quefrencies of the peaks are converted into pitch frequencies. Multiple 
pitch candidates (MPC) are passed to the query engine without choosing one candi- 
date in the feature extraction module. 

Next, confidence measures are calculated from the values of cepstral peaks. The 
confidence measures are calculated as the cepstral value of the peak divided by that 
of the top candidate. 

4.2 Evaluation of Pitch Extraction 

The pitch extraction accuracy was calculated by comparing the correct pitch value 
with the extraction result. The references were labeled by human and the accuracy of 
pitch extraction was analyzed using 260 data hummed by five subjects. 
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Make a list in numerical candidate’s power 

order starting from the 



highest peak value 

Fig. 5. The extraction flow of multiple pitch candidates and confidence measure 



V 100 f'p-Ue, fp tp+W fp + 100 
Fig. 6. A range of pitch extraction accuracy 



Table 1. Evaluation of the pitches extraction rates 



Ranks 


The 1“ Rank 


Within 


Within 


Others 


2 ranks 


3 ranks 


Extraction ratio 


88.4% 


96.3% 


99.7% 


0.3% 



The pitch extraction accuracy was calculated as in Fig. 6, by comparing the correct 
pitch frequency (f p ) with the extraction result. An extracted pitch value is regarded to 
be correct when the difference between log frequency of the correct pitch and that of 
the extracted pitch is less than 20 cents Table 1 shows the pitch extraction ac- 

curacy when one to three pitch candidates are considered. 

Harmonic frequencies (double or half pitch of the correct pitch) were extracted at 
most frames with incorrect results. The narrower the pitch range, the easier the prob- 
lem is to solve, but in music, pitch can have leaps of one octave or more. Octave 
errors are among the most common to occur in the MIR systems. 

By using three pitch candidates, the accuracy was 99.7%. This result shows that 
three pitch candidates are sufficient for the subsequent processing. 

5 DP Matching and Melody Representation 

When a user inputs humming, the following things should be considered. (1) Even if 
a user hums the same notes, there are variations in tempo and tonality. (2) Split and/or 
union of notes occur due to the ambiguous memory of a user or to interval detection 
errors in MIR system. Here, considering (1), the variation is absorbed by taking rela- 
tive values between successive notes. With respect to (2), the variation can be ab- 
sorbed by using DP matching [10], [11], [12], 
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5.1 Static Melody Representation in Conventional DP 



In conventional DP matching, a DP matching matrix, g(i, j), such as that illustrated in 
Fig. 7, is used to obtain distance D between sequence Q, with length I, and sequence 
R, with length J. The penalty between elements Q and R is represented by d. 

Database Sequence 



0 12 3 i 1-1 




Let R = jig, i p i 2 , , i, J be a database sequence of a certain number I of notes, 

each of which is encoded as a pair of pitch and span, and let Q = jjg, j v j 2 , , j 2 I } be 

humming sequence of J notes. It is possible to compute the similarity d(i, j)=\i - j | of 
the two sequences i and j recursively. Under these hypotheses, the distance is calcu- 
lated as follows. 



Boundary conditions at step I, II. 

8(0, j) = d(i 0 -j), where 0 <= j <= J-l, and 
g(i, 0) = co, where 0 <- i <=I-1 

The general procedure at step III is i=l, ,1-1 and j=l, 



g(i-2,j-\) + d(i,j) 



~ min- 



g(i-\,j-\) + d(i,j) 



g(i-\,j-2) + 2d(i,j). 



,J-1 



( 1 ) 



d ( i , j ) = }p(i, j ) + 0 - } )(i, j )■ (2) 

The penalty of the distance d(i, j) is a weight sum of pitches and spans according 
to Fig. 8. It is calculated between the humming and the database. Here, i and j are 
note number, p is the distance between the pitch of the humming and that of the data- 
base for each note, t is the distance between the span of the humming and that of the 
database for each note, and y is a weighting value. 
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Here, p and t are respectively defined as follows: 



P(U j) = | {m p (0 - m p (/ - 1) } - [h p (j) - h p (j - 1) }| 
t(i, j ) = \{m , (0 - m, (i - 1) } - - h,(j - 1)}|. 



(3) 



(4) 



where, m and m. are pitch and span value in the database respectively, and h p and h, 
represent pitch and span value in the humming. 

The conventional melody representation methods normalize the features by calcu- 
lating pitch and span ratios relative to successive note [2], [3], [21] like described in 
Eq. (3) and Eq. (4). The relative pitch values are expressed by cent. Cent is the unit of 
log frequency d(f)=1200(log f - logf 0 ) which is equal to 1/100 of a semitone. Here, /, 
and / stand for the reference frequency and the frequency with we are concerned. 
However, there are several problems in the conventional melody representation 
method. When split or union of notes occurs, relative values which are obtained using 
the pitch value of the successive note are changed. 



For example, the note G is recognized as two Gs in the humming in Fig. 9. The 
relative pitch sequence in the database is {200 cents, 200 cents, 300 cents}, while the 
sequence obtained from the humming is {200 cents, 200 cents, 300 cents, 0 cent}. 
The distance between these sequences is 300 cents. This mismatch arises from the 
fact that the calculation of relative pitch does not take the possibility of split or union 
of notes into account. 

5.2 Dynamic Melody Representation 

When a note is matched based on the hypothesis that the previous note in the hum- 
ming is split into two notes, the relative pitch of the current note in the humming 
should be considered in relation to the second note before the current note. Then the 
relative pitch of the last note in the humming data will become {G-E=300 cents} 
instead of {G-G=0 cent}. 

This means that the relative value must be dynamically determined. In the case of 
union, the relative pitches must be dynamically determined by the same method too. 
Therefore, Eq. (3) must be modified with the DP paths likes in Eq. (5) because the 
relative pitch values should be dynamically calculated considering the split or union 
of notes. 




i-2 



i-1 



i Database 



Fig. 8. The local path constraint and weight value 
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Database 




Database 




200 200 301 



split 




(a) Traditional method 



Humming 



200 200 300 



d=0 




(b) Proposed method 



Fig. 9. Relative pitch values conversion method when split occurs 



In Eq. (5) p„ 123 (i, j) corresponds to the local path constrains in Fig. 8 according to 
the DP paths. 

Pi O', j) = \{m p 0) - m p 0 - 2) } - { h p (j) - h p (j - 1) }| 

p 2 0, j) = \{m p (0 - m p (i - 1) } - {h p (j) - h p (j - 1) }| (5) 

P 3 0. j) = \{m p (0 - m p (; i - 1)} - {h p ( j ) - h p ( j - 2) }|. 



On the other hand, in the case of span, the problem cannot yet be solved by chang- 
ing of the notes which take a relative value. Therefore, the difference approach is 
needed to treat the problem. Fig. 10 shows the case where notes m3 and iri4 in the 
database are split or united in the humming. When m3 and m4 of the database are 
hummed as one note (HUM sequence A), the relative span °V o2 must be compared 
with '" 3 *'" 4 / m2 instead of with ”/ m2 . In the same manner, when m3 of the database is 
hummed as two notes (HUM sequence B), J + ’ / h2 must correspond to m3 / m2 of the data- 
base. 

When there is relative span, the actual span has to be decided according to the hy- 
pothesis of split or merge. Therefore, Eq. (4) is changed as follows with DP paths. 



!, O', j) = 
h (*"» j) - 



log 

log 

log 



m t (i -!) + »;,(/) 
mid t ( i — 2) 

m,{i) 



m,(i) 
m,(i~ 1) 



-log 

-log 



-log 

h,U) 



h,U) 
h,U~ 1 ) 



m,(i- 1) J [h,(j-\) 

h,U~\) + h,{j) 



KU~ 2 ) 



(6) 
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Fig. 10. Relative span values conversion method when split or union occurs 



Finally, the accumulation distance is modified as follows. 

g(i-2,j -l) + d l (i,j) 
g(i, j) = min] g(i - 1, j - 1) + d 2 (i, j) 
g(i-l,j-2) + 2d 3 (i,j). 



(7) 



d e O', j) = }P„ (i, j ) + (1 - 7 )t e 0, j )■ (8) 

In conventional DP matching, the value at a certain point d(i, j) is independent of 
the DP paths and only depends on i and j. In the proposed Eq. (8), however, the val- 
ues at the same point d esl23 (i, j) are changed according to the DP paths. When calcu- 
lating the similarity measurement, this equation can dynamically change the notes 
which correspond to the DP paths. This changes the notes which correspond to the 
different DP paths, because the notes which take a relative pitch and a relative span 
can be changed. Moreover, at the time of matching, CDP was extended to three di- 
mensions so that proposed multiple pitch candidates can be treated. 



6 Three-Dimensional Continuous DP Algorithm 



The features obtained from humming and the features of musical pieces in the data- 
base are matched using continuous DP. However, the query engine must be extended 
so that multiple pitch candidates along with confidence measures can be utilized. 
Therefore, the DP algorithm is extended into three dimensions as shown in Eq. (9). 
Here, g(i, j, k) is the accumulation distance of the value of the k-th pitch candidate for 
the j-th note of the humming and the i-th note of the musical piece. 



min{ g (i - 2, j - 1, /) + d r (i , j, k, l ) } 
g(i, j , k) = min] min{g(/ - 1, j - 1,1 ) + d 2 (i, j, k, /)) 

min{g(i — 1, j — 2,1) + 2 d 3 (i, j,k,l)}. 



(9) 



d e (i, j,k,l) = [5{ocp e {i, j,k,l) + (l — a]p e \j,k,l)} + { 1 - fi)t e (i, j). 



( 10 ) 
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The score d r=]23 is a weighted sum of pitches with confidence measures and span 
scores, pj i, j, k, l), c e (j, k, I) and tji, j), the former being the weight assigned to a 
particular distance of pitch, the next being the confidence measure, and the last being 
the weight assigned to the distance of span. Here, i is the value of the i-th note of the 
musical piece, k is the value of the k-th pitch candidate of the j-th humming note and / 
is the value of the l-th pitch candidate value of the j-th humming note. 

The factors a and p can be varied to reflect the relative contribution of pitch, con- 
fidence measure and span. When a= 1 and f>= 1 , the weight contribution is only based 
on pitch. On the other hand, if the p value is zero, the weight contribution is only 
based on span. Here, p e , c e and t e are respectively defined as follows. 

Pi ( i , j , k,l) = \{m p (0 - m p (i -2)}-{h p (j,k)-h p (j- 1, /) }| 

p 2 (i, j, k, l) = | {m p (i) - m p (i -\)}-{h p {j,k)-h p (j (11) 

Pi (/, j, k, l) = \{m p (i) - m p (i - 1) }-{h p (j,k)~ h p (j - 2, l) }|. 

Ci (j. k, l) = | h c (j, k ) + h ( j - 1, /)| 

c 2 (j,k,l) = \h c (j,k) + h c {j-\,l) | (12) 

c 3 ( j, k, l) = \h c (j, k ) + h c (j - 2, /)|. 



h (P j) = 


log< 


m t ( i — 1) + m t (j) 


mid , (i 


-2) 




log< 


m,(i) 
m,(i — 1) 


■-log 




log- 


m, O') 
/«,(/-!) 


— log* 



-log 



h ,U) 1 

h,U~ 1)J 



h ,(j) 1 
h,(j~ 1)J 



h,(j~l) + hpj) \ 

h,U~ 2) J 



(13) 



where, m(*) and h(*) are the database sequence and humming sequence respectively. 
In Eq. (11), Eq. (12) and Eq. (13), h p , h c , and h ! represent multiple pitch candidates, 
confidence measure and span value in humming, m p and m, are respectively, pitch and 
span value in the database. 

Fig. 1 1 shows the mechanism of the matching algorithm extended to three dimen- 
sions. The humming is matched with the database at the DP plane extended to three 
dimensions. When considering matching of the humming to the database, the 
proposed algorithm calculates the combination of all candidate points [19]. Finally, 
the algorithm determines the optimal candidate points and paths, as shown in Fig. 1 1. 



7 Experiments and Results 

7.1 Experimental Conditions 

The music database consists of children’s songs, namely 155 pieces of music from 
Japan and other countries. It was constructed with monophonic MIDI (Musical In- 
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Fig. 11. Example of the matching flow using the three-dimensional continuous DP algorithm 



strument Digital Interface) which has melody line. The pitches and the spans were 
extracted from the monophonic MIDI. We also used a query corpus which consists of 
a total of 320 queries hummed by 5 subjects. All subjects are inexperienced singers. 
They were allowed to use a headset microphone and started humming an arbitrary 
portion of a song with free tonality and free tempo. 

Table 2 shows the experimental conditions. The average number of notes in one 
humming data was 9.8 and the average humming time was 4.6 seconds. In the ex- 
periment, weighted values (a, (3) were changed from 0 to 1 by 0.1. 



Table 2. Experimental conditions 



Music Database 


Children’s song 155 pieces of music from Japan and 
other countries 


Test Data 


Humming by 5 subjects 


Sampling Frequency 


16kHz 


Windows and Frame 


64 ms Hamming Windows, 8ms 


Event Detection 


Band Pass Filter : 600-l,500Hz 
Differential Filter : Primary differential 


Features 


Multiple pitch candidates, Confidence measure, Span 



7.2 Evaluation Measurement 

Evaluation measurement is an important factor in the evaluation of the performance 
of a retrieval method. Effectiveness of a retrieval system is measured by its ability to 
satisfy the user. Conventionally, such effectiveness is measured in terms of precision 
and recall, and fallout and generality [5]. For different requests submitted to the sys- 
tem, it is possible to plot precision to recall. However, the drawbacks of measuring 
effectiveness in terms of just a pair has led to the development of composite meas- 
ures, which use the contingency tables but combine parts of them into single number 
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measures. Several retrieval systems display all of the same score’s music pieces which 
are considered correct answer. However, in this paper, when retrieved musical pieces 
have same score and those of numbers are exceeded the concerned ranks, candidates 
are chosen from the same score’s pieces by random numbers according to the Eq. 
(14). The performances of the systems are compared using ‘ retrieval accuracy ’ which 
is a probability that the target song is included in the top-R outputs. The retrieval 
accuracy A(R) was calculated as follows: 



Q . 



1 






R-r(i) + 1 

«,(r(i)) 



0 



if r(i) + n, (r(ij) -]<R, 
elseif r(i) + n, (r(i)) — ]> R and r(i)<R, 
otherwise 



(14) 



where Q denotes the number of queries and n.(R) is the number of candidates within 
rank R in the query i considering candidates within the same rate. r(i) is a ranking of 
the correct in retrieved results. For example, if the hummed tune ranks 2nd position 
by the correct answer and three pieces has the same score at rank 2, the retrieval accu- 
racy T.( R ) is changed according to the concerning rank R. If three candidates have the 
same top score and one of the candidates is the target, n j (l)=3 and T j (l)=l/3. 



7.3 Results 



In order to investigate the performance of the proposed algorithm, we carried out 
music retrieval experiments. In the evaluation, a conventional music retrieval system 
was implemented and the retrieval results were compared those of the proposed sys- 
tem. 

The experimental results for music retrieval are shown in Table 3. Here, “Coarse- 
to-Fine” and “Category 27” denote the methods in reference [3]. “Conventional” 
stands for using static melody representation method described in section 5.1, and 
“Dynamic” stands for using dynamic melody representation method. 



Table 3. Comparisons of the accuracy by various features (%) 



Features 


Melody 

representation 


Crank 


Ranked in the 
top- 10 


Weight 

values 


Span only 


Conventional 


40.9 


67.3 


p= 0.0 


Proposed 


42.6 


71.2 


Pitch only 


Conventional 


61.4 


83.9 


«=1.0 


Proposed 


66.1 


90.4 


>0=1.0 


Span + Pitch 


Conventional 


70.8 


86.7 


a=1.0 


Proposed 


73.9 


91.1 


/?=0.6 


Span + Pitch(3) 


Conventional 


79.6 


90.2 


a=0.5 


+ CM 


Proposed 


86.5 


94.1 


13=0.1 


Coarse-to-Fine 


Conventional 


78.4 


89.6 


y=0.5 


Category 27 


Conventional 


81.6 


91.5 


y=0.5 
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The Coarse-to-Fine matching method is used to effectively reduce the number of 
answer candidates by gradually increasing the number of categories of relative values 
used in the DP matching. For example, 3 categories of pitch represent the situation 
where a note is above (up), below (down) or the equal (same) as the previous note. 
The conventional system utilized dynamic thresholds. Boundaries of the categories in 
this system were tuned using all of humming data. In the experiment, the Coarse-to- 
Fine matching utilized three steps to increase the categories (3 — >9 — >27). The penalty 
of the distance used Eq. (2). 

From these results, the proposed melody representation method gave about 3-7 
points higher retrieval accuracies compared to the conventional method. By using 
multiple pitch candidates, the retrieval accuracy improved from 73.9% to 86.5%. The 
retrieval results using “Coarse-to-Fine” and “Category 27” were 78.4% and 81.6% at 
top-1 rank, respectively. Even when we compared “Category 27” and the proposed 
method “Span+ Pitch(3)+ CM”, the retrieval accuracy was improved about 5% at the 
1st rank. Therefore, by using multiple pitch candidates in dynamic melody represen- 
tation, the accuracy was improved, and thus the validity could be confirmed. 

7.4 MIR System Implementation 

We built a GUI based query-by-humming system. A whole set of tools and deliver- 
able software was implemented, and experiments were conducted to evaluate the 
system performance as well as to explore other melody perception issues. 

The system was built on MS Windows platform and developed using Microsoft 
Visual C++. The system consists of five Class which are CWaveBuffer Class for 
management of voice input/output buffer, CWaveGraph Class for a humming signal 
display, CFeatureExtraction Class for extraction multiple pitch candidates and spans, 
CDB Search Class for searching database using extracted information and CMainDia- 
logue Class for management all of the dialogs. 

The Implemented humming retrieval system called MuseFinder is shown in Fig. 
12. The window areas are divided in three blocks which are a sound information 
window, a feature information window and a retrieval information window. The 
sound information window displays the query wave. The feature information window 
shows single pitch candidate and/or multiple pitch candidates with spans. The re- 
trieval information window shows a ranked list of matching melodies. The retrieval 
result contains a list of melodies, with a song name and the time position where there 
is matched in the target tune. Moreover, the user can listen to the music by clicking 
hit results. 

8 Conclusions 

In this paper, we implemented a practical query-by-humming system, which can find 
a piece of music in the database based on a few hummed notes. Moreover, we pro- 
posed a novel error tolerant melody matching method of retrieving music information 
in response to hummed queries. The user's hummed input differs from ideal input for 
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Fig. 12. Prototype of MIR system (MuseFinder) 



several reasons, including individual differences in tonality and tempo, and singing 
errors based on ambiguous memory. Considering these problems, the optimum 
neighbor note to take the relative value is dynamically determined according to the 
DP paths. 

Furthermore, even if the hummed queries are perfect, it is still difficult to retrieve 
the pitch perfectly from the hummed queries. To consider the pitch extraction errors, 
we proposed the use of multiple pitch candidates. Using the proposed method, a pitch 
extraction accuracy of 99.7% was obtained within the third rank. Moreover, we pro- 
posed the use of a similarity measurement algorithm that extends the search space of 
the DP plane into three dimensions for robust matching to the pitch extraction errors 
in the query processing. This is based on a continuous dynamic programming algo- 
rithm with features including spans, and multiple pitch candidates along with their 
confidence measures. 

We evaluated this system by measuring retrieval accuracy on a database of 155 
songs with a total of 320 queries. When using three-pitch candidates with confidence 
measures and span features, the top-10 retrieval accuracy was 94.1% and the top-1 
retrieval accuracy was 86.5%. These results showed better retrieval than the conven- 
tional system. Thus, the advantage of the proposed methods is apparent. 
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